diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770248.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770248.png new file mode 100644 index 0000000..e433406 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770248.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770252.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770252.png new file mode 100644 index 0000000..88c4617 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770252.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770256.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770256.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770256.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770260.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770260.png new file mode 100644 index 0000000..3fcea1d Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770260.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770264.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770264.png new file mode 100644 index 0000000..e433406 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770264.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770268.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770268.png new file mode 100644 index 0000000..e433406 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770268.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770272.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770272.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770272.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770280.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770280.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770280.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770296.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770296.png new file mode 100644 index 0000000..0066d29 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770296.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770300.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770300.png new file mode 100644 index 0000000..c2e7355 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770300.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770304.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770304.jpg new file mode 100644 index 0000000..ab5f657 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770304.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770320.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770320.png new file mode 100644 index 0000000..adef870 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770320.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770328.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770328.png new file mode 100644 index 0000000..81cb0a0 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770328.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770332.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770332.png new file mode 100644 index 0000000..eb64472 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770332.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770356.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770356.png new file mode 100644 index 0000000..1e3f9c8 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770356.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770400.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770400.png new file mode 100644 index 0000000..5712e6f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770400.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770408.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770408.png new file mode 100644 index 0000000..982f1d4 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770408.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770424.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770424.png new file mode 100644 index 0000000..de71c78 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770424.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770428.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770428.png new file mode 100644 index 0000000..47b458d Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770428.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770484.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770484.jpg new file mode 100644 index 0000000..ab5f657 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770484.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770504.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770504.png new file mode 100644 index 0000000..d7daddd Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770504.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770592.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770592.jpg new file mode 100644 index 0000000..c2321e0 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770592.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770612.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770612.png new file mode 100644 index 0000000..c322e0f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770612.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770632.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770632.jpg new file mode 100644 index 0000000..fdba4bb Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770632.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770636.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770636.png new file mode 100644 index 0000000..5602e1a Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770636.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770640.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770640.png new file mode 100644 index 0000000..5712e6f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770640.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770664.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770664.png new file mode 100644 index 0000000..de57a41 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770664.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770716.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770716.png new file mode 100644 index 0000000..e8313fc Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770716.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770720.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770720.png new file mode 100644 index 0000000..571ce44 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770720.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770724.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770724.png new file mode 100644 index 0000000..95a4c17 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770724.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770740.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770740.jpg new file mode 100644 index 0000000..d3b2d60 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770740.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770748.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770748.jpg new file mode 100644 index 0000000..d780caa Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770748.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770752.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770752.png new file mode 100644 index 0000000..b8b7ce3 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770752.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770764.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770764.png new file mode 100644 index 0000000..58dff54 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770764.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770796.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770796.jpg new file mode 100644 index 0000000..4f2665c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770796.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770828.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770828.png new file mode 100644 index 0000000..2e86d4b Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770828.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770848.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770848.png new file mode 100644 index 0000000..5712e6f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770848.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770868.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770868.png new file mode 100644 index 0000000..8d7f24d Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295770868.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930212.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930212.png new file mode 100644 index 0000000..e433406 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930212.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930220.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930220.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930220.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930228.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930228.png new file mode 100644 index 0000000..e433406 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930228.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930232.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930232.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930232.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930236.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930236.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930236.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930260.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930260.png new file mode 100644 index 0000000..5712e6f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930260.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930284.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930284.png new file mode 100644 index 0000000..9ff3c49 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930284.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930292.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930292.png new file mode 100644 index 0000000..678c499 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930292.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930296.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930296.jpg new file mode 100644 index 0000000..a7834ff Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930296.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930364.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930364.jpg new file mode 100644 index 0000000..ab5f657 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930364.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930368.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930368.png new file mode 100644 index 0000000..f4e33bc Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930368.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930388.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930388.png new file mode 100644 index 0000000..ce2ad69 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930388.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930408.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930408.jpg new file mode 100644 index 0000000..632b5d8 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930408.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930432.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930432.png new file mode 100644 index 0000000..18c36c8 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930432.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930444.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930444.jpg new file mode 100644 index 0000000..ab5f657 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930444.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930452.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930452.png new file mode 100644 index 0000000..4d0c33c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930452.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930456.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930456.jpg new file mode 100644 index 0000000..3215889 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930456.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930528.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930528.png new file mode 100644 index 0000000..5712e6f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930528.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930552.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930552.jpg new file mode 100644 index 0000000..ab5f657 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930552.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930560.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930560.jpg new file mode 100644 index 0000000..1e6acc1 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930560.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930564.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930564.jpg new file mode 100644 index 0000000..d780caa Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930564.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930576.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930576.png new file mode 100644 index 0000000..507a6a0 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930576.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930596.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930596.png new file mode 100644 index 0000000..9718e87 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930596.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930600.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930600.jpg new file mode 100644 index 0000000..cbd9123 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930600.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930604.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930604.png new file mode 100644 index 0000000..a9fb4cf Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930604.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930624.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930624.png new file mode 100644 index 0000000..9c3bcb8 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930624.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930632.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930632.png new file mode 100644 index 0000000..1c171d9 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930632.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930684.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930684.png new file mode 100644 index 0000000..8f7fb24 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930684.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930704.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930704.png new file mode 100644 index 0000000..8f7fb24 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930704.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930708.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930708.png new file mode 100644 index 0000000..a400d0c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930708.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930712.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930712.jpg new file mode 100644 index 0000000..d780caa Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930712.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930716.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930716.jpg new file mode 100644 index 0000000..a7834ff Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930716.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930720.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930720.png new file mode 100644 index 0000000..1a53295 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930720.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930724.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930724.png new file mode 100644 index 0000000..893bfa4 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930724.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930780.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930780.png new file mode 100644 index 0000000..0de59f6 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930780.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930800.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930800.png new file mode 100644 index 0000000..a7f7197 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930800.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930836.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930836.png new file mode 100644 index 0000000..2639cfa Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295930836.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001295931412.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295931412.jpg new file mode 100644 index 0000000..ff6f642 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001295931412.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090044.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090044.png new file mode 100644 index 0000000..88c4617 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090044.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090048.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090048.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090048.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090052.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090052.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090052.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090056.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090056.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090056.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090060.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090060.png new file mode 100644 index 0000000..88c4617 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090060.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090092.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090092.png new file mode 100644 index 0000000..c9c769b Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090092.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090112.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090112.png new file mode 100644 index 0000000..910ddf9 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090112.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090140.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090140.jpg new file mode 100644 index 0000000..4f2665c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090140.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090188.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090188.png new file mode 100644 index 0000000..835f418 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090188.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090192.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090192.png new file mode 100644 index 0000000..8c1d5ae Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090192.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090196.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090196.png new file mode 100644 index 0000000..26e4f1f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090196.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090200.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090200.png new file mode 100644 index 0000000..db698bb Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090200.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090208.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090208.png new file mode 100644 index 0000000..645ae12 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090208.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090268.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090268.png new file mode 100644 index 0000000..5712e6f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090268.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090276.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090276.jpg new file mode 100644 index 0000000..580705c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090276.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090292.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090292.jpg new file mode 100644 index 0000000..d3b2d60 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090292.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090328.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090328.jpg new file mode 100644 index 0000000..7a2719b Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090328.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090360.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090360.jpg new file mode 100644 index 0000000..ab5f657 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090360.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090372.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090372.png new file mode 100644 index 0000000..7d358ca Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090372.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090388.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090388.jpg new file mode 100644 index 0000000..4465e9a Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090388.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090400.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090400.png new file mode 100644 index 0000000..15b06bc Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090400.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090404.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090404.png new file mode 100644 index 0000000..b432527 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090404.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090416.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090416.png new file mode 100644 index 0000000..feaa32b Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090416.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090420.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090420.png new file mode 100644 index 0000000..39b3bb4 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090420.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090424.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090424.png new file mode 100644 index 0000000..c2d4842 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090424.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090428.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090428.png new file mode 100644 index 0000000..e92c89d Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090428.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090484.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090484.jpg new file mode 100644 index 0000000..4f2665c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090484.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090492.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090492.png new file mode 100644 index 0000000..e8313fc Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090492.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090496.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090496.jpg new file mode 100644 index 0000000..6a6b0df Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090496.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090504.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090504.png new file mode 100644 index 0000000..e8313fc Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090504.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090524.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090524.jpg new file mode 100644 index 0000000..d780caa Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090524.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090532.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090532.png new file mode 100644 index 0000000..e5f282b Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090532.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090540.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090540.png new file mode 100644 index 0000000..a422c9a Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090540.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090544.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090544.png new file mode 100644 index 0000000..a7baeb7 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090544.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090548.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090548.png new file mode 100644 index 0000000..e5f282b Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090548.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090588.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090588.jpg new file mode 100644 index 0000000..c80477c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090588.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090600.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090600.png new file mode 100644 index 0000000..b87dfd2 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090600.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090656.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090656.png new file mode 100644 index 0000000..876e516 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090656.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090668.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090668.jpg new file mode 100644 index 0000000..18e121c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296090668.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249680.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249680.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249680.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249684.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249684.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249684.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249692.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249692.png new file mode 100644 index 0000000..88c4617 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249692.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249696.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249696.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249696.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249700.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249700.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249700.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249724.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249724.png new file mode 100644 index 0000000..9ccebe3 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249724.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249732.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249732.png new file mode 100644 index 0000000..b97a411 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249732.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249756.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249756.png new file mode 100644 index 0000000..0e71a11 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249756.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249764.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249764.jpg new file mode 100644 index 0000000..3cc1624 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249764.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249840.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249840.jpg new file mode 100644 index 0000000..987f49c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249840.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249912.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249912.png new file mode 100644 index 0000000..b9c0266 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249912.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249920.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249920.png new file mode 100644 index 0000000..5712e6f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249920.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249924.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249924.jpg new file mode 100644 index 0000000..4f2665c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249924.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249932.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249932.jpg new file mode 100644 index 0000000..ab5f657 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249932.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249936.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249936.jpg new file mode 100644 index 0000000..a8eab51 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249936.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249940.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249940.png new file mode 100644 index 0000000..51d6084 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249940.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249948.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249948.png new file mode 100644 index 0000000..2a833ab Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296249948.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250004.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250004.jpg new file mode 100644 index 0000000..4f2665c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250004.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250048.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250048.png new file mode 100644 index 0000000..f20bb9f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250048.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250052.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250052.png new file mode 100644 index 0000000..983ba06 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250052.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250068.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250068.png new file mode 100644 index 0000000..ce2ad69 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250068.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250076.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250076.png new file mode 100644 index 0000000..d3e5ef2 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250076.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250104.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250104.png new file mode 100644 index 0000000..1c171d9 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250104.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250116.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250116.png new file mode 100644 index 0000000..a5e6966 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250116.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250132.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250132.jpg new file mode 100644 index 0000000..ab5f657 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250132.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250136.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250136.png new file mode 100644 index 0000000..5712e6f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250136.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250144.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250144.png new file mode 100644 index 0000000..9c631cb Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250144.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250156.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250156.png new file mode 100644 index 0000000..e5f282b Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250156.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250188.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250188.jpg new file mode 100644 index 0000000..d3b2d60 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250188.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250192.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250192.png new file mode 100644 index 0000000..e5f282b Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250192.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250196.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250196.png new file mode 100644 index 0000000..d7daddd Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250196.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250224.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250224.png new file mode 100644 index 0000000..d7daddd Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250224.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250232.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250232.jpg new file mode 100644 index 0000000..b65526a Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250232.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250248.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250248.png new file mode 100644 index 0000000..89bc431 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250248.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250300.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250300.png new file mode 100644 index 0000000..8c38b56 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250300.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250312.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250312.png new file mode 100644 index 0000000..1afe031 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250312.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250852.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250852.jpg new file mode 100644 index 0000000..ff6f642 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001296250852.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770069.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770069.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770069.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770073.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770073.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770073.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770077.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770077.png new file mode 100644 index 0000000..70d5aee Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770077.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770081.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770081.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770081.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770085.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770085.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770085.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770089.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770089.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770089.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770093.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770093.png new file mode 100644 index 0000000..e433406 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770093.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770097.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770097.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770097.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770113.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770113.png new file mode 100644 index 0000000..f22baec Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770113.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770157.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770157.jpg new file mode 100644 index 0000000..a078111 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770157.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770217.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770217.png new file mode 100644 index 0000000..1bb2279 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770217.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770221.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770221.png new file mode 100644 index 0000000..26442d2 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770221.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770225.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770225.png new file mode 100644 index 0000000..81c14c5 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770225.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770293.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770293.png new file mode 100644 index 0000000..0505c9a Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770293.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770301.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770301.png new file mode 100644 index 0000000..6a5f7f4 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770301.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770305.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770305.jpg new file mode 100644 index 0000000..5daa8d2 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770305.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770313.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770313.png new file mode 100644 index 0000000..e2407bc Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770313.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770317.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770317.jpg new file mode 100644 index 0000000..85b923d Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770317.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770401.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770401.png new file mode 100644 index 0000000..579d63d Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770401.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770409.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770409.png new file mode 100644 index 0000000..5712e6f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770409.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770417.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770417.png new file mode 100644 index 0000000..51d6084 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770417.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770421.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770421.jpg new file mode 100644 index 0000000..a7834ff Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770421.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770433.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770433.png new file mode 100644 index 0000000..d311bd3 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770433.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770457.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770457.jpg new file mode 100644 index 0000000..ab5f657 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770457.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770481.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770481.png new file mode 100644 index 0000000..5712e6f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770481.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770537.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770537.png new file mode 100644 index 0000000..ea6c4d3 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770537.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770541.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770541.png new file mode 100644 index 0000000..848b318 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770541.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770553.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770553.jpg new file mode 100644 index 0000000..d780caa Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770553.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770561.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770561.png new file mode 100644 index 0000000..08417f1 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770561.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770573.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770573.png new file mode 100644 index 0000000..08417f1 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770573.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770577.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770577.png new file mode 100644 index 0000000..a400d0c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770577.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770605.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770605.png new file mode 100644 index 0000000..e5f282b Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770605.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770621.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770621.png new file mode 100644 index 0000000..799c568 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770621.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770629.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770629.png new file mode 100644 index 0000000..907f15c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770629.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770649.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770649.png new file mode 100644 index 0000000..0de59f6 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348770649.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348771181.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348771181.png new file mode 100644 index 0000000..073f84c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348771181.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001348771241.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348771241.jpg new file mode 100644 index 0000000..ee938d0 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001348771241.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089885.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089885.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089885.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089893.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089893.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089893.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089897.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089897.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089897.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089901.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089901.png new file mode 100644 index 0000000..88c4617 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089901.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089905.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089905.png new file mode 100644 index 0000000..88c4617 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089905.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089981.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089981.jpg new file mode 100644 index 0000000..1b7bea9 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349089981.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090017.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090017.jpg new file mode 100644 index 0000000..10c8fff Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090017.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090021.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090021.png new file mode 100644 index 0000000..337cb75 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090021.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090029.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090029.png new file mode 100644 index 0000000..7673582 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090029.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090041.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090041.png new file mode 100644 index 0000000..fe5f15c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090041.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090061.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090061.png new file mode 100644 index 0000000..0cc2cb1 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090061.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090113.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090113.png new file mode 100644 index 0000000..ce2ad69 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090113.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090137.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090137.jpg new file mode 100644 index 0000000..d780caa Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090137.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090165.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090165.png new file mode 100644 index 0000000..50c9af8 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090165.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090229.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090229.png new file mode 100644 index 0000000..1a53295 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090229.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090241.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090241.png new file mode 100644 index 0000000..4f2e171 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090241.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090245.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090245.png new file mode 100644 index 0000000..1848745 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090245.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090297.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090297.jpg new file mode 100644 index 0000000..fdba4bb Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090297.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090305.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090305.jpg new file mode 100644 index 0000000..954285a Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090305.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090333.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090333.png new file mode 100644 index 0000000..cfb29d3 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090333.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090345.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090345.png new file mode 100644 index 0000000..2d147d0 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090345.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090349.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090349.png new file mode 100644 index 0000000..f1f9114 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090349.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090353.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090353.png new file mode 100644 index 0000000..1d361a0 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090353.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090373.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090373.png new file mode 100644 index 0000000..1a53295 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090373.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090381.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090381.png new file mode 100644 index 0000000..e5f282b Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090381.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090385.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090385.png new file mode 100644 index 0000000..08417f1 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090385.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090389.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090389.png new file mode 100644 index 0000000..ba818e1 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090389.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090393.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090393.png new file mode 100644 index 0000000..6a6b344 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090393.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090429.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090429.png new file mode 100644 index 0000000..5712e6f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090429.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090445.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090445.png new file mode 100644 index 0000000..ccfc82d Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090445.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090457.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090457.png new file mode 100644 index 0000000..1f18dec Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090457.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090473.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090473.jpg new file mode 100644 index 0000000..ab5f657 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090473.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090497.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090497.png new file mode 100644 index 0000000..5fb38e8 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349090497.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169781.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169781.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169781.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169785.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169785.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169785.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169789.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169789.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169789.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169793.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169793.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169793.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169797.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169797.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169797.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169801.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169801.png new file mode 100644 index 0000000..e433406 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169801.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169805.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169805.png new file mode 100644 index 0000000..e433406 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169805.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169809.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169809.png new file mode 100644 index 0000000..88c4617 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169809.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169825.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169825.png new file mode 100644 index 0000000..ae151b4 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169825.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169829.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169829.jpg new file mode 100644 index 0000000..f2efd5d Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169829.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169853.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169853.png new file mode 100644 index 0000000..c0966c1 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169853.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169857.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169857.jpg new file mode 100644 index 0000000..59be1dd Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169857.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169861.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169861.png new file mode 100644 index 0000000..3802fd5 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169861.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169877.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169877.jpg new file mode 100644 index 0000000..ab5f657 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169877.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169933.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169933.png new file mode 100644 index 0000000..595e6c9 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169933.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169941.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169941.png new file mode 100644 index 0000000..4d6a9c6 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169941.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169945.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169945.png new file mode 100644 index 0000000..de71c78 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169945.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169981.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169981.png new file mode 100644 index 0000000..2f0ab66 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349169981.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170061.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170061.png new file mode 100644 index 0000000..89241d2 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170061.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170097.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170097.jpg new file mode 100644 index 0000000..a953036 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170097.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170105.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170105.png new file mode 100644 index 0000000..8e8bb98 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170105.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170125.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170125.jpg new file mode 100644 index 0000000..59be1dd Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170125.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170129.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170129.jpg new file mode 100644 index 0000000..4988b17 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170129.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170133.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170133.jpg new file mode 100644 index 0000000..6044c06 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170133.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170145.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170145.png new file mode 100644 index 0000000..4f85c09 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170145.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170149.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170149.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170149.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170153.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170153.png new file mode 100644 index 0000000..18bd857 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170153.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170201.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170201.jpg new file mode 100644 index 0000000..8db1bfb Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170201.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170225.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170225.jpg new file mode 100644 index 0000000..76439ca Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170225.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170229.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170229.png new file mode 100644 index 0000000..e8313fc Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170229.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170237.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170237.png new file mode 100644 index 0000000..ffb212d Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170237.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170249.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170249.png new file mode 100644 index 0000000..1c171d9 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170249.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170269.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170269.png new file mode 100644 index 0000000..a95b38d Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170269.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170277.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170277.png new file mode 100644 index 0000000..1a53295 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170277.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170281.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170281.png new file mode 100644 index 0000000..a400d0c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170281.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170285.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170285.jpg new file mode 100644 index 0000000..59be1dd Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170285.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170289.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170289.png new file mode 100644 index 0000000..d7daddd Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170289.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170305.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170305.png new file mode 100644 index 0000000..13d30ea Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170305.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170313.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170313.png new file mode 100644 index 0000000..08adb71 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170313.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170329.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170329.png new file mode 100644 index 0000000..ab41ac2 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170329.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170337.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170337.png new file mode 100644 index 0000000..a976086 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170337.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170353.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170353.png new file mode 100644 index 0000000..0fdd6f9 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170353.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170365.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170365.png new file mode 100644 index 0000000..f832929 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170365.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170393.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170393.jpg new file mode 100644 index 0000000..ce88e71 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170393.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170953.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170953.png new file mode 100644 index 0000000..fc479f2 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349170953.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289353.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289353.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289353.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289357.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289357.png new file mode 100644 index 0000000..51f1f16 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289357.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289361.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289361.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289361.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289365.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289365.png new file mode 100644 index 0000000..7dd8f2a Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289365.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289369.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289369.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289369.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289373.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289373.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289373.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289377.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289377.png new file mode 100644 index 0000000..8506e84 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289377.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289401.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289401.png new file mode 100644 index 0000000..3604e23 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289401.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289417.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289417.png new file mode 100644 index 0000000..b87dfd2 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289417.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289421.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289421.png new file mode 100644 index 0000000..3e27e30 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289421.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289425.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289425.jpg new file mode 100644 index 0000000..311483e Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289425.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289429.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289429.jpg new file mode 100644 index 0000000..cb3b7a7 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289429.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289433.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289433.png new file mode 100644 index 0000000..fed9d11 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289433.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289449.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289449.jpg new file mode 100644 index 0000000..4f2665c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289449.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289453.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289453.png new file mode 100644 index 0000000..5712e6f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289453.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289481.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289481.jpg new file mode 100644 index 0000000..c157e54 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289481.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289501.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289501.png new file mode 100644 index 0000000..934379c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289501.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289509.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289509.png new file mode 100644 index 0000000..645ae12 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289509.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289521.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289521.png new file mode 100644 index 0000000..afcb8e4 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289521.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289525.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289525.png new file mode 100644 index 0000000..afcb8e4 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289525.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289573.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289573.png new file mode 100644 index 0000000..b475254 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289573.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289589.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289589.png new file mode 100644 index 0000000..bca1dde Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289589.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289609.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289609.png new file mode 100644 index 0000000..5712e6f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289609.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289613.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289613.jpg new file mode 100644 index 0000000..4f3a02a Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289613.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289617.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289617.png new file mode 100644 index 0000000..92c06a0 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289617.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289681.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289681.png new file mode 100644 index 0000000..f2b1c54 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289681.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289709.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289709.png new file mode 100644 index 0000000..a400d0c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289709.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289713.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289713.jpg new file mode 100644 index 0000000..d3b2d60 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289713.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289717.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289717.jpg new file mode 100644 index 0000000..4465e9a Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289717.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289777.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289777.png new file mode 100644 index 0000000..ce2ad69 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289777.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289781.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289781.png new file mode 100644 index 0000000..1c171d9 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289781.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289813.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289813.png new file mode 100644 index 0000000..d147ee6 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289813.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289821.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289821.png new file mode 100644 index 0000000..9c631cb Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289821.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289833.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289833.png new file mode 100644 index 0000000..12dbb63 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289833.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289837.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289837.png new file mode 100644 index 0000000..b5f56eb Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289837.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289861.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289861.jpg new file mode 100644 index 0000000..1e6acc1 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289861.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289865.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289865.png new file mode 100644 index 0000000..482b501 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289865.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289869.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289869.jpg new file mode 100644 index 0000000..6044c06 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289869.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289873.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289873.png new file mode 100644 index 0000000..defaffc Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289873.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289877.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289877.png new file mode 100644 index 0000000..216e652 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289877.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289889.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289889.png new file mode 100644 index 0000000..c8bf6b1 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289889.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289901.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289901.jpg new file mode 100644 index 0000000..d3b2d60 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289901.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289909.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289909.jpg new file mode 100644 index 0000000..ab5f657 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289909.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289921.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289921.png new file mode 100644 index 0000000..df79a39 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289921.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289933.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289933.png new file mode 100644 index 0000000..1f18dec Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289933.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289937.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289937.png new file mode 100644 index 0000000..0fdd6f9 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289937.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289953.jpg b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289953.jpg new file mode 100644 index 0000000..c3bb1e4 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289953.jpg differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289997.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289997.png new file mode 100644 index 0000000..5990a01 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349289997.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001349290529.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349290529.png new file mode 100644 index 0000000..3906418 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001349290529.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001387862162.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001387862162.png new file mode 100644 index 0000000..6b1c754 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001387862162.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001387880686.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001387880686.png new file mode 100644 index 0000000..a8d3d34 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001387880686.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001387894476.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001387894476.png new file mode 100644 index 0000000..0485869 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001387894476.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001387912132.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001387912132.png new file mode 100644 index 0000000..ddb62bf Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001387912132.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001387925652.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001387925652.png new file mode 100644 index 0000000..1ab4eb1 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001387925652.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001388045030.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388045030.png new file mode 100644 index 0000000..a04446f Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388045030.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001388065252.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388065252.png new file mode 100644 index 0000000..dcfcb58 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388065252.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001388066504.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388066504.png new file mode 100644 index 0000000..274a51a Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388066504.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001388071348.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388071348.png new file mode 100644 index 0000000..7666e8a Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388071348.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001388203690.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388203690.png new file mode 100644 index 0000000..3934d76 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388203690.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001388250740.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388250740.png new file mode 100644 index 0000000..d6fb289 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388250740.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001388362146.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388362146.png new file mode 100644 index 0000000..a669172 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388362146.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001388394084.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388394084.png new file mode 100644 index 0000000..21b15f6 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388394084.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001388415558.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388415558.png new file mode 100644 index 0000000..0f6e61e Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388415558.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001388527202.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388527202.png new file mode 100644 index 0000000..56978c2 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388527202.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001388575174.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388575174.png new file mode 100644 index 0000000..3e4c50e Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001388575174.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001389252974.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001389252974.png new file mode 100644 index 0000000..307d524 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001389252974.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001389336372.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001389336372.png new file mode 100644 index 0000000..abf560a Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001389336372.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001389422168.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001389422168.png new file mode 100644 index 0000000..0584dae Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001389422168.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001389429344.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001389429344.png new file mode 100644 index 0000000..384d1b4 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001389429344.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001389443044.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001389443044.png new file mode 100644 index 0000000..53484ed Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001389443044.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001389467018.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001389467018.png new file mode 100644 index 0000000..9b4e616 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001389467018.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001437950709.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001437950709.png new file mode 100644 index 0000000..c5b7c36 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001437950709.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001438276253.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438276253.png new file mode 100644 index 0000000..9ed2614 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438276253.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001438291713.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438291713.png new file mode 100644 index 0000000..ccd3c81 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438291713.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001438393405.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438393405.png new file mode 100644 index 0000000..a0e0958 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438393405.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001438453365.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438453365.png new file mode 100644 index 0000000..f0beb4c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438453365.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001438507709.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438507709.png new file mode 100644 index 0000000..e1ce8e8 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438507709.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001438508081.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438508081.png new file mode 100644 index 0000000..8396491 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438508081.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001438709421.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438709421.png new file mode 100644 index 0000000..6c185aa Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438709421.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001438712537.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438712537.png new file mode 100644 index 0000000..7ae9cd3 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438712537.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001438729629.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438729629.png new file mode 100644 index 0000000..5d0632b Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438729629.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001438733129.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438733129.png new file mode 100644 index 0000000..1cca7a4 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438733129.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001438951649.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438951649.png new file mode 100644 index 0000000..8b0df3b Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438951649.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001438962057.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438962057.png new file mode 100644 index 0000000..79bef4c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001438962057.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001439150893.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439150893.png new file mode 100644 index 0000000..8b0df3b Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439150893.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001439293525.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439293525.png new file mode 100644 index 0000000..d4cd60e Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439293525.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001439299573.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439299573.png new file mode 100644 index 0000000..42846fb Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439299573.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001439380285.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439380285.png new file mode 100644 index 0000000..9a21a94 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439380285.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001439498673.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439498673.png new file mode 100644 index 0000000..05658e4 Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439498673.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001439698689.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439698689.png new file mode 100644 index 0000000..b1a24ad Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439698689.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001439709249.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439709249.png new file mode 100644 index 0000000..13cfeeb Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439709249.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001439763713.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439763713.png new file mode 100644 index 0000000..fd11e3a Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001439763713.png differ diff --git a/doc/component-operation-guide/source/_static/images/en-us_image_0000001441210777.png b/doc/component-operation-guide/source/_static/images/en-us_image_0000001441210777.png new file mode 100644 index 0000000..5977b2c Binary files /dev/null and b/doc/component-operation-guide/source/_static/images/en-us_image_0000001441210777.png differ diff --git a/doc/component-operation-guide/source/appendix/accessing_manager/accessing_fusioninsight_manager_mrs_3.x_or_later.rst b/doc/component-operation-guide/source/appendix/accessing_manager/accessing_fusioninsight_manager_mrs_3.x_or_later.rst new file mode 100644 index 0000000..d40205f --- /dev/null +++ b/doc/component-operation-guide/source/appendix/accessing_manager/accessing_fusioninsight_manager_mrs_3.x_or_later.rst @@ -0,0 +1,110 @@ +:original_name: mrs_01_2124.html + +.. _mrs_01_2124: + +Accessing FusionInsight Manager (MRS 3.\ *x* or Later) +====================================================== + +Scenario +-------- + +In MRS 3.\ *x* or later, FusionInsight Manager is used to monitor, configure, and manage clusters. After the cluster is installed, you can use the account to log in to FusionInsight Manager. + +.. note:: + + If you cannot log in to the WebUI of the component, access FusionInsight Manager by referring to :ref:`Accessing FusionInsight Manager from an ECS `. + +Accessing FusionInsight Manager Using EIP +----------------------------------------- + +#. Log in to the MRS management console. + +#. In the navigation pane, choose **Clusters** > **Active Clusters**. Click the target cluster name to access the cluster details page. + +#. Click **Manager** next to **MRS Manager**. In the displayed dialog box, configure the EIP information. + + a. If no EIP is bound during MRS cluster creation, select an available EIP from the drop-down list on the right of **IEP**. If you have bound an EIP when creating a cluster, go to :ref:`3.b `. + + .. note:: + + If no EIP is available, click **Manage EIP** to create one. Then, select the created EIP from the drop-down list on the right of **EIP**. + + b. .. _mrs_01_2124__en-us_topic_0264431534_li59591846143810: + + Select the security group to which the security group rule to be added belongs. The security group is configured when the cluster is created. + + c. Add a security group rule. By default, the filled-in rule is used to access the EIP. To enable multiple IP address segments to access Manager, see steps :ref:`6 ` to :ref:`9 `. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + + d. Select the information to be confirmed and click **OK**. + +#. Click **OK**. The Manager login page is displayed. + +#. Enter the default username **admin** and the password set during cluster creation, and click **Log In**. The Manager page is displayed. + +#. .. _mrs_01_2124__en-us_topic_0264431534_en-us_topic_0035209594_li1049410469610: + + On the MRS management console, choose **Clusters** > **Active Clusters**. Click the target cluster name to access the cluster details page. + + .. note:: + + To grant other users the permission to access Manager, perform :ref:`6 ` to :ref:`9 ` to add the users' public IP addresses to the trusted IP address range. + +#. Click **Add Security Group Rule** on the right of **EIP**. + +#. On the **Add Security Group Rule** page, add the IP address segment for users to access the public network and select **I confirm that public network IP/port is a trusted public IP address. I understand that using 0.0.0.0/0. poses security risks**. + + By default, the IP address used for accessing the public network is filled. You can change the IP address segment as required. To enable multiple IP address segments, repeat steps :ref:`6 ` to :ref:`9 `. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + +#. .. _mrs_01_2124__en-us_topic_0264431534_en-us_topic_0035209594_li035723593115: + + Click **OK**. + +.. _mrs_01_2124__en-us_topic_0264431534_section20880102283115: + +Accessing FusionInsight Manager from an ECS +------------------------------------------- + +#. On the MRS management console, click **Clusters**. + +#. On the **Active Clusters** page, click the name of the specified cluster. + + Record the **AZ**, **VPC**, **MRS Manager**\ **Security Group** of the cluster. + +#. On the homepage of the management console, choose **Service List** > **Elastic Cloud Server** to switch to the ECS management console and create an ECS. + + - The **AZ**, **VPC**, and **Security Group** of the ECS must be the same as those of the cluster to be accessed. + - Select a Windows public image. For example, a standard image **Windows Server 2012 R2 Standard 64bit(40GB)**. + - For details about other configuration parameters, see **Elastic Cloud Server > User Guide > Getting Started > Creating and Logging In to a Windows ECS**. + + .. note:: + + If the security group of the ECS is different from **Default Security Group** of the Master node, you can modify the configuration using either of the following methods: + + - Change the security group of the ECS to the default security group of the Master node. For details, see **Elastic Cloud Server** > **User Guide** > **Security Group** > **Changing a Security Group**. + - Add two security group rules to the security groups of the Master and Core nodes to enable the ECS to access the cluster. Set **Protocol** to **TCP**, **Ports** of the two security group rules to **28443** and **20009**, respectively. For details, see **Virtual Private Cloud > User Guide > Security > Security Group > Adding a Security Group Rule**. + +#. On the VPC management console, apply for an EIP and bind it to the ECS. + + For details, see **Virtual Private Cloud** > **User Guide** > **Elastic IP** > **Assigning an EIP and Binding It to an ECS**. + +#. Log in to the ECS. + + The Windows system account, password, EIP, and the security group rules are required for logging in to the ECS. For details, see **Elastic Cloud Server > User Guide > Instances > Logging In to a Windows ECS**. + +#. On the Windows remote desktop, use your browser to access Manager. + + For example, you can use Internet Explorer 11 in the Windows 2012 OS. + + The address for accessing Manager is the address of the **MRS Manager** page. Enter the name and password of the cluster user, for example, user **admin**. + + |image1| + + .. note:: + + - If you access Manager with other cluster usernames, change the password upon your first access. The new password must meet the requirements of the current password complexity policies. For details, contact the system administrator. + - By default, a user is locked after inputting an incorrect password five consecutive times. The user is automatically unlocked after 5 minutes. + +#. Log out of FusionInsight Manager. To log out of Manager, move the cursor to |image2| in the upper right corner and click **Log Out**. + +.. |image1| image:: /_static/images/en-us_image_0000001438733129.png +.. |image2| image:: /_static/images/en-us_image_0000001438453365.png diff --git a/doc/component-operation-guide/source/appendix/accessing_manager/accessing_mrs_manager_versions_earlier_than_mrs_3.x.rst b/doc/component-operation-guide/source/appendix/accessing_manager/accessing_mrs_manager_versions_earlier_than_mrs_3.x.rst new file mode 100644 index 0000000..0dd2e29 --- /dev/null +++ b/doc/component-operation-guide/source/appendix/accessing_manager/accessing_mrs_manager_versions_earlier_than_mrs_3.x.rst @@ -0,0 +1,96 @@ +:original_name: mrs_01_0102.html + +.. _mrs_01_0102: + +Accessing MRS Manager (Versions Earlier Than MRS 3.x) +===================================================== + +Scenario +-------- + +Clusters of versions earlier than MRS 3.x use MRS Manager to monitor, configure, and manage clusters. You can open the MRS Manager page on the MRS console. + +Accessing MRS manager +--------------------- + +#. Log in to the MRS management console. + +#. In the navigation pane, choose **Clusters** > **Active Clusters**. Click the target cluster name to access the cluster details page. + +#. Click **Access Manager**. The **Access MRS Manager** page is displayed. + + - If you have bound an EIP when creating a cluster, + + a. Select the security group to which the security group rule to be added belongs. The security group is configured when the cluster is created. + b. Add a security group rule. By default, your public IP address used for accessing port 9022 is filled in the rule. To enable multiple IP address segments to access MRS Manager, see :ref:`6 ` to :ref:`9 `. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + + .. note:: + + - It is normal that the automatically generated public IP address is different from the local IP address and no action is required. + - If port 9022 is a Knox port, you need to enable the permission of port 9022 to access Knox for accessing MRS Manager. + + c. Select the checkbox stating that **I confirm that xx.xx.xx.xx is a trusted public IP address and MRS Manager can be accessed using this IP address**. + + - If you have not bound an EIP when creating a cluster, + + a. Select an available EIP from the drop-down list or click **Manage EIP** to create one. + b. Select the security group to which the security group rule to be added belongs. The security group is configured when the cluster is created. + c. Add a security group rule. By default, your public IP address used for accessing port 9022 is filled in the rule. To enable multiple IP address segments to access MRS Manager, see :ref:`6 ` to :ref:`9 `. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + + .. note:: + + - It is normal that the automatically generated public IP address is different from the local IP address and no action is required. + - If port 9022 is a Knox port, you need to enable the permission of port 9022 to access Knox for accessing MRS Manager. + + d. Select the checkbox stating that **I confirm that xx.xx.xx.xx is a trusted public IP address and MRS Manager can be accessed using this IP address**. + +#. Click **OK**. The MRS Manager login page is displayed. + +#. Enter the default username **admin** and the password set during cluster creation, and click **Log In**. The MRS Manager page is displayed. + +#. .. _mrs_01_0102__en-us_topic_0264269234_en-us_topic_0035209594_li1049410469610: + + On the MRS console, click **Clusters** and choose **Active Clusters**. Click the target cluster name to access the cluster details page. + + .. note:: + + To assign MRS Manager access permissions to other users, follow instructions from :ref:`6 ` to :ref:`9 ` to add the users' public IP addresses to the trusted range. + +#. Click **Add Security Group Rule** on the right of **EIP**. + +#. On the **Add Security Group Rule** page, add the IP address segment for users to access the public network and select **I confirm that the authorized object is a trusted public IP address range. Do not use 0.0.0.0/0. Otherwise, security risks may arise**. + + By default, the IP address used for accessing the public network is filled. You can change the IP address segment as required. To enable multiple IP address segments, repeat steps :ref:`6 ` to :ref:`9 `. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + +#. .. _mrs_01_0102__en-us_topic_0264269234_en-us_topic_0035209594_li035723593115: + + Click **OK**. + +If the cluster version is **MRS 1.7.2** and earlier and Kerberos authentication is not enabled for the cluster, perform the following operations: + +#. Log in to the MRS management console. + +#. In the navigation pane, click **Clusters** and choose **Active Clusters**. Click the target cluster name to access the cluster details page. + +#. Click **Access MRS Manager**. + + After logging in to the MRS management console, you can access MRS Manager. By default, user **admin** is used for login. You do not need to enter the password again. + +If the cluster version is **MRS 1.7.2** or earlier and Kerberos authentication is enabled for the cluster, see **Accessing Web Pages of Open Source Components Managed in MRS Clusters** > **Access Using a Windows ECS** in the *MapReduce Service User Guide*. + +Granting the Permission to Access MRS Manager to Other Users +------------------------------------------------------------ + +#. .. _mrs_01_0102__en-us_topic_0264269234_li1750491811399: + + On the MRS console, click **Clusters** and choose **Active Clusters**. Click the target cluster name to access the cluster details page. + +#. Click **Add Security Group Rule** on the right of **EIP**. + +#. On the **Add Security Group Rule** page, add the IP address segment for users to access the public network and select.\ **I confirm that the authorized object is a trusted public IP address range. Do not use 0.0.0.0/0. Otherwise, security risks may arise.** + + By default, the IP address used for accessing the public network is filled. You can change the IP address segment as required. To enable multiple IP address segments, repeat steps :ref:`1 ` to :ref:`4 `. If you want to view, modify, or delete a security group rule, click **Manage Security Group Rule**. + +#. .. _mrs_01_0102__en-us_topic_0264269234_li55051218183912: + + Click **OK**. diff --git a/doc/component-operation-guide/source/appendix/accessing_manager/index.rst b/doc/component-operation-guide/source/appendix/accessing_manager/index.rst new file mode 100644 index 0000000..0edc59e --- /dev/null +++ b/doc/component-operation-guide/source/appendix/accessing_manager/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_2123.html + +.. _mrs_01_2123: + +Accessing Manager +================= + +- :ref:`Accessing MRS Manager (Versions Earlier Than MRS 3.x) ` +- :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + accessing_mrs_manager_versions_earlier_than_mrs_3.x + accessing_fusioninsight_manager_mrs_3.x_or_later diff --git a/doc/component-operation-guide/source/appendix/index.rst b/doc/component-operation-guide/source/appendix/index.rst new file mode 100644 index 0000000..3e4129b --- /dev/null +++ b/doc/component-operation-guide/source/appendix/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_2122.html + +.. _mrs_01_2122: + +Appendix +======== + +- :ref:`Modifying Cluster Service Configuration Parameters ` +- :ref:`Accessing Manager ` +- :ref:`Using an MRS Client ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + modifying_cluster_service_configuration_parameters + accessing_manager/index + using_an_mrs_client/index diff --git a/doc/component-operation-guide/source/appendix/modifying_cluster_service_configuration_parameters.rst b/doc/component-operation-guide/source/appendix/modifying_cluster_service_configuration_parameters.rst new file mode 100644 index 0000000..f38577c --- /dev/null +++ b/doc/component-operation-guide/source/appendix/modifying_cluster_service_configuration_parameters.rst @@ -0,0 +1,74 @@ +:original_name: mrs_01_2125.html + +.. _mrs_01_2125: + +Modifying Cluster Service Configuration Parameters +================================================== + +- For MRS 1.9.2 or later: You can modify service configuration parameters on the cluster management page of the MRS management console. + + #. Log in to the MRS console. In the left navigation pane, choose **Clusters** > **Active Clusters**, and click a cluster name. + + #. Choose **Components** > *Name of the desired service* > **Service Configuration**. + + The **Basic Configuration** tab page is displayed by default. To modify more parameters, click the **All Configurations** tab. The navigation tree displays all configuration parameters of the service. The level-1 nodes in the navigation tree are service names or role names. The parameter category is displayed after the level-1 node is expanded. + + #. In the navigation tree, select the specified parameter category and change the parameter values on the right. + + If you are not sure about the location of a parameter, you can enter the parameter name in search box in the upper right corner. The system searches for the parameter in real time and displays the result. + + #. Click **Save Configuration**. In the displayed dialog box, click **OK**. + + #. Wait until the message **Operation successful** is displayed. Click **Finish**. + + The configuration is modified. + + Check whether there is any service whose configuration has expired in the cluster. If yes, restart the corresponding service or role instance for the configuration to take effect. You can also select **Restart the affected services or instances** when saving the configuration. . + +- For MRS 3.\ *x* or earlier: You can log in to MRS Manager to modify service configuration parameters. + + #. Log in to MRS Manager. + + #. Click **Services**. + + #. Click the specified service name on the service management page. + + #. Click **Service Configuration**. + + The **Basic Configuration** tab page is displayed by default. To modify more parameters, click the **All Configurations** tab. The navigation tree displays all configuration parameters of the service. The level-1 nodes in the navigation tree are service names or role names. The parameter category is displayed after the level-1 node is expanded. + + #. In the navigation tree, select the specified parameter category and change the parameter values on the right. + + If you are not sure about the location of a parameter, you can enter the parameter name in search box in the upper right corner. The system searches for the parameter in real time and displays the result. + + #. Click **Save**. In the confirmation dialog box, click **OK**. + + #. Wait until the message **Operation successful** is displayed. Click **Finish**. + + The configuration is modified. + + Check whether there is any service whose configuration has expired in the cluster. If yes, restart the corresponding service or role instance for the configuration to take effect. You can also select **Restart the affected services or instances** when saving the configuration. + +- For MRS 3.\ *x* or later: You can log in to FusionInsight Manager to modify service configuration parameters. + + #. You have logged in to FusionInsight Manager. + + #. Choose **Cluster** > **Service**. + + #. Click the specified service name on the service management page. + + #. Click **Configuration**. + + The **Basic Configuration** tab page is displayed by default. To modify more parameters, click the **All Configurations** tab. The navigation tree displays all configuration parameters of the service. The level-1 nodes in the navigation tree are service names or role names. The parameter category is displayed after the level-1 node is expanded. + + #. In the navigation tree, select the specified parameter category and change the parameter values on the right. + + If you are not sure about the location of a parameter, you can enter the parameter name in search box in the upper right corner. The system searches for the parameter in real time and displays the result. + + #. Click **Save**. In the confirmation dialog box, click **OK**. + + #. Wait until the message **Operation successful** is displayed. Click **Finish**. + + The configuration is modified. + + Check whether there is any service whose configuration has expired in the cluster. If yes, restart the corresponding service or role instance for the configuration to take effect. diff --git a/doc/component-operation-guide/source/appendix/using_an_mrs_client/index.rst b/doc/component-operation-guide/source/appendix/using_an_mrs_client/index.rst new file mode 100644 index 0000000..47421f4 --- /dev/null +++ b/doc/component-operation-guide/source/appendix/using_an_mrs_client/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_2126.html + +.. _mrs_01_2126: + +Using an MRS Client +=================== + +- :ref:`Installing a Client (Version 3.x or Later) ` +- :ref:`Installing a Client (Versions Earlier Than 3.x) ` +- :ref:`Updating a Client (Version 3.x or Later) ` +- :ref:`Updating a Client (Versions Earlier Than 3.x) ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + installing_a_client_version_3.x_or_later + installing_a_client_versions_earlier_than_3.x + updating_a_client_version_3.x_or_later + updating_a_client_versions_earlier_than_3.x diff --git a/doc/component-operation-guide/source/appendix/using_an_mrs_client/installing_a_client_version_3.x_or_later.rst b/doc/component-operation-guide/source/appendix/using_an_mrs_client/installing_a_client_version_3.x_or_later.rst new file mode 100644 index 0000000..0a2c88a --- /dev/null +++ b/doc/component-operation-guide/source/appendix/using_an_mrs_client/installing_a_client_version_3.x_or_later.rst @@ -0,0 +1,231 @@ +:original_name: mrs_01_2127.html + +.. _mrs_01_2127: + +Installing a Client (Version 3.x or Later) +========================================== + +Scenario +-------- + +This section describes how to install clients of all services (excluding Flume) in an MRS cluster. For details about how to install the Flume client, see `Installing the Flume Client `__. + +A client can be installed on a node inside or outside the cluster. This section uses the installation directory **//opt/client** as an example. Replace it with the actual one. + +.. _mrs_01_2127__en-us_topic_0264269828_en-us_topic_0270713152_en-us_topic_0264269418_section3219221104310: + +Prerequisites +------------- + +- A Linux ECS has been prepared. For details about the supported OS of the ECS, see :ref:`Table 1 `. + + .. _mrs_01_2127__en-us_topic_0264269828_en-us_topic_0270713152_en-us_topic_0264269418_table40818788104630: + + .. table:: **Table 1** Reference list + + +-------------------------+--------+-------------------------------------------------+ + | CPU Architecture | OS | Supported Version | + +=========================+========+=================================================+ + | x86 computing | Euler | EulerOS 2.5 | + +-------------------------+--------+-------------------------------------------------+ + | | SUSE | SUSE Linux Enterprise Server 12 SP4 (SUSE 12.4) | + +-------------------------+--------+-------------------------------------------------+ + | | RedHat | Red Hat-7.5-x86_64 (Red Hat 7.5) | + +-------------------------+--------+-------------------------------------------------+ + | | CentOS | CentOS 7.6 | + +-------------------------+--------+-------------------------------------------------+ + | Kunpeng computing (Arm) | Euler | EulerOS 2.8 | + +-------------------------+--------+-------------------------------------------------+ + | | CentOS | CentOS 7.6 | + +-------------------------+--------+-------------------------------------------------+ + + In addition, sufficient disk space is allocated for the ECS, for example, 40 GB. + +- The ECS and the MRS cluster are in the same VPC. + +- The security group of the ECS must be the same as that of the master node in the MRS cluster. + +- The NTP service has been installed on the ECS OS and is running properly. + + If the NTP service is not installed, run the **yum install ntp -y** command to install it when the **yum** source is configured. + +- A user can log in to the Linux ECS using the password (in SSH mode). + +.. _mrs_01_2127__en-us_topic_0264269828_section181806577218: + +Installing a Client on a Node Inside a Cluster +---------------------------------------------- + +#. Obtain the software package. + + Log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Click the name of the cluster to be operated in the **Cluster** drop-down list. + + Choose **More > Download Client**. The **Download Cluster Client** dialog box is displayed. + + .. note:: + + In the scenario where only one client is to be installed, choose **Cluster >** **Service >** *Service name* **> More > Download Client**. The **Download Client** dialog box is displayed. + +#. Set the client type to **Complete Client**. + + **Configuration Files Only** is to download client configuration files in the following scenario: After a complete client is downloaded and installed and administrators modify server configurations on Manager, developers need to update the configuration files during application development. + + The platform type can be set to **x86_64** or **aarch64**. + + - **x86_64**: indicates the client software package that can be deployed on the x86 servers. + - **aarch64**: indicates the client software package that can be deployed on the TaiShan servers. + + .. note:: + + The cluster supports two types of clients: **x86_64** and **aarch64**. The client type must match the architecture of the node for installing the client. Otherwise, client installation will fail. + +#. Select **Save to Path** and click **OK** to generate the client file. + + The generated file is stored in the **/tmp/FusionInsight-Client** directory on the active management node by default. You can also store the client file in a directory on which user **omm** has the read, write, and execute permissions. Copy the software package to the file directory on the server where the client is to be installed as user **omm** or **root**. + + The name of the client software package is in the follow format: **FusionInsight_Cluster\_\ <**\ *Cluster ID*\ **>\ \_Services_Client.tar**. In this section, the cluster ID **1** is used as an example. Replace it with the actual cluster ID. + + The following steps and sections use **FusionInsight_Cluster_1_Services_Client.tar** as an example. + + .. note:: + + If you cannot obtain the permissions of user **root**, use user **omm**. + + To install the client on another node in the cluster, run the following command to copy the client to the node where the client is to be installed: + + **scp -p /**\ *tmp/FusionInsight-Client*\ **/FusionInsight_Cluster_1_Services_Client.tar** *IP address of the node where the client is to be installed:/opt/Bigdata/client* + +#. Log in to the server where the client software package is located as user **user_client**. + +#. Decompress the software package. + + Go to the directory where the installation package is stored, such as **/tmp/FusionInsight-Client**. Run the following command to decompress the installation package to a local directory: + + **tar -xvf** **FusionInsight_Cluster_1_Services_Client.tar** + +#. Verify the software package. + + Run the following command to verify the decompressed file and check whether the command output is consistent with the information in the **sha256** file. + + **sha256sum -c** **FusionInsight_Cluster_1_Services_ClientConfig.tar.sha256** + + .. code-block:: + + FusionInsight_Cluster_1_Services_ClientConfig.tar: OK + +#. Decompress the obtained installation file. + + **tar -xvf** **FusionInsight_Cluster_1_Services_ClientConfig.tar** + +#. Go to the directory where the installation package is stored, and run the following command to install the client to a specified directory (an absolute path), for example, **/opt/client**: + + **cd /tmp/FusionInsight-Client/FusionInsight\_Cluster_1_Services_ClientConfig** + + Run the **./install.sh /opt/client** command to install the client. The client is successfully installed if information similar to the following is displayed: + + .. code-block:: + + The component client is installed successfully + + .. note:: + + - If the clients of all or some services use the **/opt/client** directory, other directories must be used when you install other service clients. + - You must delete the client installation directory when uninstalling a client. + - To ensure that an installed client can only be used by the installation user (for example, **user_client**), add parameter **-o** during the installation. That is, run the **./install.sh /opt/client -o** command to install the client. + - If an HBase client is installed, it is recommended that the client installation directory contain only uppercase and lowercase letters, digits, and characters ``(_-?.@+=)`` due to the limitation of the Ruby syntax used by HBase. + +Using a Client +-------------- + +#. On the node where the client is installed, run the **sudo su - omm** command to switch the user. Run the following command to go to the client directory: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *MRS cluster user* + + Example: **kinit admin** + + .. note:: + + User **admin** is created by default for MRS clusters with Kerberos authentication enabled and is used for administrators to maintain the clusters. + +#. Run the client command of a component directly. + + For example, run the **hdfs dfs -ls /** command to view files in the HDFS root directory. + +Installing a Client on a Node Outside a Cluster +----------------------------------------------- + +#. Create an ECS that meets the requirements in :ref:`Prerequisites `. +#. Perform NTP time synchronization to synchronize the time of nodes outside the cluster with that of the MRS cluster. + + a. Run the **vi /etc/ntp.conf** command to edit the NTP client configuration file, add the IP addresses of the master node in the MRS cluster, and comment out the IP address of other servers. + + .. code-block:: + + server master1_ip prefer + server master2_ip + + + .. figure:: /_static/images/en-us_image_0000001438729629.png + :alt: **Figure 1** Adding the master node IP addresses + + **Figure 1** Adding the master node IP addresses + + b. Run the **service ntpd stop** command to stop the NTP service. + + c. Run the following command to manually synchronize the time: + + **/usr/sbin/ntpdate** *192.168.10.8* + + .. note:: + + **192.168.10.8** indicates the IP address of the active Master node. + + d. Run the **service ntpd start** or **systemctl restart ntpd** command to start the NTP service. + + e. Run the **ntpstat** command to check the time synchronization result. + +#. Perform the following steps to download the cluster client software package from FusionInsight Manager, copy the package to the ECS node, and install the client: + + a. Log in to FusionInsight Manager and download the cluster client to the specified directory on the active management node by referring to :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) ` and :ref:`Installing a Client on a Node Inside a Cluster `. + + b. Log in to the active management node as user **root** and run the following command to copy the client installation package to the target node: + + **scp -p /tmp/FusionInsight-Client/FusionInsight_Cluster_1_Services_Client.tar** *IP address of the node where the client is to be installed*\ **:/tmp** + + c. Log in to the node on which the client is to be installed as the client user. + + Run the following commands to install the client. If the user does not have operation permissions on the client software package and client installation directory, grant the permissions using the **root** user. + + **cd /tmp** + + **tar -xvf** **FusionInsight_Cluster_1_Services_Client.tar** + + **tar -xvf** **FusionInsight_Cluster_1_Services_ClientConfig.tar** + + **cd FusionInsight\_Cluster_1_Services_ClientConfig** + + **./install.sh /opt/client** + + d. Run the following commands to switch to the client directory and configure environment variables: + + **cd /opt/client** + + **source bigdata_env** + + e. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *MRS cluster user* + + Example: **kinit admin** + + f. Run the client command of a component directly. + + For example, run the **hdfs dfs -ls /** command to view files in the HDFS root directory. diff --git a/doc/component-operation-guide/source/appendix/using_an_mrs_client/installing_a_client_versions_earlier_than_3.x.rst b/doc/component-operation-guide/source/appendix/using_an_mrs_client/installing_a_client_versions_earlier_than_3.x.rst new file mode 100644 index 0000000..04d06be --- /dev/null +++ b/doc/component-operation-guide/source/appendix/using_an_mrs_client/installing_a_client_versions_earlier_than_3.x.rst @@ -0,0 +1,260 @@ +:original_name: mrs_01_2128.html + +.. _mrs_01_2128: + +Installing a Client (Versions Earlier Than 3.x) +=============================================== + +Scenario +-------- + +An MRS client is required. The MRS cluster client can be installed on the Master or Core node in the cluster or on a node outside the cluster. + +After a cluster of versions earlier than MRS 3.x is created, a client is installed on the active Master node by default. You can directly use the client. The installation directory is **/opt/client**. + +For details about how to install a client of MRS 3.x or later, see :ref:`Installing a Client (Version 3.x or Later) `. + +.. note:: + + If a client has been installed on the node outside the MRS cluster and the client only needs to be updated, update the client using the user who installed the client, for example, user **root**. + +Prerequisites +------------- + +- An ECS has been prepared. For details about the OS and its version of the ECS, see :ref:`Table 1 `. + + .. _mrs_01_2128__en-us_topic_0264269418_table40818788104630: + + .. table:: **Table 1** Reference list + + +-----------------------------------+-----------------------------------+ + | OS | Supported Version | + +===================================+===================================+ + | EulerOS | - Available: EulerOS 2.2 | + | | - Available: EulerOS 2.3 | + | | - Available: EulerOS 2.5 | + +-----------------------------------+-----------------------------------+ + + For example, a user can select the enterprise image **Enterprise_SLES11_SP4_latest(4GB)** or standard image **Standard_CentOS_7.2_latest(4GB)** to prepare the OS for an ECS. + + In addition, sufficient disk space is allocated for the ECS, for example, 40 GB. + +- The ECS and the MRS cluster are in the same VPC. + +- The security group of the ECS is the same as that of the Master node of the MRS cluster. + + If this requirement is not met, modify the ECS security group or configure the inbound and outbound rules of the ECS security group to allow the ECS security group to be accessed by all security groups of MRS cluster nodes. + +- To enable users to log in to a Linux ECS using a password (SSH), see **Instances** *>* **Logging In to a Linux ECS** *>* **Login Using an SSH Password** *in the Elastic Cloud Server User Guide*. + +Installing a Client on the Core Node +------------------------------------ + +#. Log in to MRS Manager and choose **Services** > **Download Client** to download the client installation package to the active management node. + + .. note:: + + If only the client configuration file needs to be updated, see method 2 in :ref:`Updating a Client (Versions Earlier Than 3.x) `. + +#. Use the IP address to search for the active management node, and log in to the active management node using VNC. + +#. Log in to the active management node, and run the following command to switch the user: + + **sudo su - omm** + +#. On the MRS management console, view the IP address on the **Nodes** tab page of the specified cluster. + + Record the IP address of the Core node where the client is to be used. + +#. On the active management node, run the following command to copy the client installation package to the Core node: + + **scp -p /tmp/MRS-client/MRS\_Services_Client.tar** *IP address of the Core node*\ **:/opt/client** + +#. Log in to the Core node as user **root**. + + For details, see `Login Using an SSH Key `__. + +#. Run the following commands to install the client: + + **cd /opt/client** + + **tar -xvf** **MRS\_Services_Client.tar** + + **tar -xvf MRS\ \_\ Services_ClientConfig.tar** + + **cd /opt/client/MRS\_Services_ClientConfig** + + **./install.sh** *Client installation directory* + + For example, run the following command: + + **./install.sh /opt/client** + +#. For details about how to use the client, see :ref:`Using an MRS Client `. + +.. _mrs_01_2128__en-us_topic_0264269418_section8796733802: + +Using an MRS Client +------------------- + +#. On the node where the client is installed, run the **sudo su - omm** command to switch the user. Run the following command to go to the client directory: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *MRS cluster user* + + Example: **kinit admin** + + .. note:: + + User **admin** is created by default for MRS clusters with Kerberos authentication enabled and is used for administrators to maintain the clusters. + +#. Run the client command of a component directly. + + For example, run the **hdfs dfs -ls /** command to view files in the HDFS root directory. + +Installing a Client on a Node Outside the Cluster +------------------------------------------------- + +#. Create an ECS that meets the requirements in the prerequisites. + +#. .. _mrs_01_2128__en-us_topic_0264269418_li1148114223118: + + Log in to MRS Manager. For details, see :ref:`Accessing MRS Manager (Versions Earlier Than MRS 3.x) `. Then, choose **Services**. + +#. Click **Download Client**. + +#. In **Client Type**, select **All client files**. + +#. In **Download To**, select **Remote host**. + +#. .. _mrs_01_2128__en-us_topic_0264269418_li24260068101924: + + Set **Host IP Address** to the IP address of the ECS, **Host Port** to **22**, and **Save Path** to **/home/linux**. + + - If the default port **22** for logging in to an ECS using SSH has been changed, set **Host Port** to the new port. + - **Save Path** contains a maximum of 256 characters. + +#. Set **Login User** to **root**. + + If other users are used, ensure that the users have read, write, and execute permission on the save path. + +#. In **SSH Private Key**, select and upload the key file used for creating cluster B. + +#. Click **OK** to generate a client file. + + If the following information is displayed, the client package is saved. Click **Close**. Obtain the client file from the save path on the remote host that is set when the client is downloaded. + + .. code-block:: text + + Client files downloaded to the remote host successfully. + + If the following information is displayed, check the username, password, and security group configurations of the remote host. Ensure that the username and password are correct and an inbound rule of the SSH (22) port has been added to the security group of the remote host. And then, go to :ref:`2 ` to download the client again. + + .. code-block:: text + + Failed to connect to the server. Please check the network connection or parameter settings. + + .. note:: + + Generating a client will occupy a large number of disk I/Os. You are advised not to download a client when the cluster is being installed, started, and patched, or in other unstable states. + +#. Log in to the ECS using VNC. For details, see **Instance** > **Logging In to a Linux** > **Logging In to a Linux** in the *Elastic Cloud Server* *User Guide* + + Log in to the ECS. For details, see `Login Using an SSH Key `__. Set the ECS password and log in to the ECS in VNC mode. + +#. Perform NTP time synchronization to synchronize the time of nodes outside the cluster with the time of the MRS cluster. + + a. Check whether the NTP service is installed. If it is not installed, run the **yum install ntp -y** command to install it. + + b. Run the **vim /etc/ntp.conf** command to edit the NTP client configuration file, add the IP address of the Master node in the MRS cluster, and comment out the IP addresses of other servers. + + .. code-block:: + + server master1_ip prefer + server master2_ip + + + .. figure:: /_static/images/en-us_image_0000001388250740.png + :alt: **Figure 1** Adding the Master node IP addresses + + **Figure 1** Adding the Master node IP addresses + + c. Run the **service ntpd stop** command to stop the NTP service. + + d. Run the following command to manually synchronize the time: + + **/usr/sbin/ntpdate** *192.168.10.8* + + .. note:: + + **192.168.10.8** indicates the IP address of the active Master node. + + e. Run the **service ntpd start** or **systemctl restart ntpd** command to start the NTP service. + + f. Run the **ntpstat** command to check the time synchronization result: + +#. On the ECS, switch to user **root** and copy the installation package in **Save Path** in :ref:`6 ` to the **/opt** directory. For example, if **Save Path** is set to **/home/linux**, run the following commands: + + **sudo su - root** + + **cp /home/linux/MRS_Services_Client.tar /opt** + +#. Run the following command in the **/opt** directory to decompress the package and obtain the verification file and the configuration package of the client: + + **tar -xvf MRS\_Services_Client.tar** + +#. Run the following command to verify the configuration file package of the client: + + **sha256sum -c MRS\_Services_ClientConfig.tar.sha256** + + The command output is as follows: + + .. code-block:: + + MRS_Services_ClientConfig.tar: OK + +#. Run the following command to decompress **MRS_Services_ClientConfig.tar**: + + **tar -xvf MRS\_Services_ClientConfig.tar** + +#. Run the following command to install the client to a new directory, for example, **/opt/Bigdata/client**. A directory is automatically generated during the client installation. + + **sh /opt/MRS\_Services_ClientConfig/install.sh /opt/Bigdata/client** + + If the following information is displayed, the client has been successfully installed: + + .. code-block:: + + Components client installation is complete. + +#. Check whether the IP address of the ECS node is connected to the IP address of the cluster Master node. + + For example, run the following command: **ping** *Master node IP address*. + + - If yes, go to :ref:`18 `. + - If no, check whether the VPC and security group are correct and whether the ECS and the MRS cluster are in the same VPC and security group, and go to :ref:`18 `. + +#. .. _mrs_01_2128__en-us_topic_0264269418_li6406429718107: + + Run the following command to configure environment variables: + + **source /opt/Bigdata/client/bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *MRS cluster user* + + Example: **kinit admin** + +#. Run the client command of a component. + + For example, run the following command to query the HDFS directory: + + **hdfs dfs -ls /** diff --git a/doc/component-operation-guide/source/appendix/using_an_mrs_client/updating_a_client_version_3.x_or_later.rst b/doc/component-operation-guide/source/appendix/using_an_mrs_client/updating_a_client_version_3.x_or_later.rst new file mode 100644 index 0000000..4dbe24a --- /dev/null +++ b/doc/component-operation-guide/source/appendix/using_an_mrs_client/updating_a_client_version_3.x_or_later.rst @@ -0,0 +1,84 @@ +:original_name: mrs_01_2129.html + +.. _mrs_01_2129: + +Updating a Client (Version 3.x or Later) +======================================== + +A cluster provides a client for you to connect to a server, view task results, or manage data. If you modify service configuration parameters on Manager and restart the service, you need to download and install the client again or use the configuration file to update the client. + +Updating the Client Configuration +--------------------------------- + +**Method 1**: + +#. Log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Click the name of the cluster to be operated in the **Cluster** drop-down list. + +#. Choose **More** > **Download Client** > **Configuration Files Only**. + + The generated compressed file contains the configuration files of all services. + +#. Determine whether to generate a configuration file on the cluster node. + + - If yes, select **Save to Path**, and click **OK** to generate the client file. By default, the client file is generated in **/tmp/FusionInsight-Client** on the active management node. You can also store the client file in other directories, and user **omm** has the read, write, and execute permissions on the directories. Then go to :ref:`4 `. + - If no, click **OK**, specify a local save path, and download the complete client. Wait until the download is complete and go to :ref:`4 `. + +#. .. _mrs_01_2129__en-us_topic_0000001142158626_en-us_topic_0263899336_en-us_topic_0193213946_l6af983f03121493ca3526296f5b650c3: + + Use WinSCP to save the compressed file to the client installation directory, for example, **/opt/hadoopclient**, as the client installation user. + +#. Decompress the software package. + + Run the following commands to go to the directory where the client is installed, and decompress the file to a local directory. For example, the downloaded client file is **FusionInsight_Cluster_1_Services_Client.tar**. + + **cd /opt/hadoopclient** + + **tar -xvf** **FusionInsight_Cluster_1\_Services_Client.tar** + +#. Verify the software package. + + Run the following command to verify the decompressed file and check whether the command output is consistent with the information in the **sha256** file. + + **sha256sum -c** **FusionInsight\_\ Cluster_1\_\ Services_ClientConfig_ConfigFiles.tar.sha256** + + .. code-block:: + + FusionInsight_Cluster_1_Services_ClientConfig_ConfigFiles.tar: OK + +#. Decompress the package to obtain the configuration file. + + **tar -xvf FusionInsight\_\ Cluster_1\_\ Services_ClientConfig_ConfigFiles.tar** + +#. Run the following command in the client installation directory to update the client using the configuration file: + + **sh refreshConfig.sh** *Client installation directory* *Directory where the configuration file is located* + + For example, run the following command: + + **sh refreshConfig.sh** **/opt/hadoopclient /opt/hadoop\ client/FusionInsight\_Cluster_1_Services_ClientConfig\_ConfigFiles** + + If the following information is displayed, the configurations have been updated successfully. + + .. code-block:: + + Succeed to refresh components client config. + +**Method 2**: + +#. Log in to the client installation node as user **root**. + +#. Go to the client installation directory, for example, **/opt/hadoopclient** and run the following commands to update the configuration file: + + **cd /opt/hadoopclient** + + **sh autoRefreshConfig.sh** + +#. Enter the username and password of the FusionInsight Manager administrator and the floating IP address of FusionInsight Manager. + +#. Enter the names of the components whose configuration needs to be updated. Use commas (,) to separate the component names. Press **Enter** to update the configurations of all components if necessary. + + If the following information is displayed, the configurations have been updated successfully. + + .. code-block:: + + Succeed to refresh components client config. diff --git a/doc/component-operation-guide/source/appendix/using_an_mrs_client/updating_a_client_versions_earlier_than_3.x.rst b/doc/component-operation-guide/source/appendix/using_an_mrs_client/updating_a_client_versions_earlier_than_3.x.rst new file mode 100644 index 0000000..8f28c25 --- /dev/null +++ b/doc/component-operation-guide/source/appendix/using_an_mrs_client/updating_a_client_versions_earlier_than_3.x.rst @@ -0,0 +1,197 @@ +:original_name: mrs_01_2130.html + +.. _mrs_01_2130: + +Updating a Client (Versions Earlier Than 3.x) +============================================= + +.. note:: + + This section applies to clusters of versions earlier than MRS 3.x. For MRS 3.x or later, see :ref:`Updating a Client (Version 3.x or Later) `. + +Updating a Client Configuration File +------------------------------------ + +**Scenario** + +An MRS cluster provides a client for you to connect to a server, view task results, or manage data. Before using an MRS client, you need to download and update the client configuration file if service configuration parameters are modified and a service is restarted or the service is merely restarted on MRS Manager. + +During cluster creation, the original client is stored in the **/opt/client** directory on all nodes in the cluster by default. After the cluster is created, only the client of a Master node can be directly used. To use the client of a Core node, you need to update the client configuration file first. + +**Procedure** + +**Method 1:** + +#. Log in to MRS Manager. For details, see :ref:`Accessing MRS Manager (Versions Earlier Than MRS 3.x) `. Then, choose **Services**. + +#. Click **Download Client**. + + Set **Client Type** to **Only configuration files**, **Download To** to **Server**, and click **OK** to generate the client configuration file. The generated file is saved in the **/tmp/MRS-client** directory on the active management node by default. You can customize the file path. + +#. Query and log in to the active Master node. + +#. If you use the client in the cluster, run the following command to switch to user **omm**. If you use the client outside the cluster, switch to user **root**. + + **sudo su - omm** + +#. Run the following command to switch to the client directory, for example, **/opt/Bigdata/client**: + + **cd /opt/Bigdata/client** + +#. Run the following command to update client configurations: + + **sh refreshConfig.sh** *Client installation directory* *Full path of the client configuration file package* + + For example, run the following command: + + **sh refreshConfig.sh /opt/Bigdata/client /tmp/MRS-client/MRS_Services_Client.tar** + + If the following information is displayed, the configurations have been updated successfully. + + .. code-block:: + + ReFresh components client config is complete. + Succeed to refresh components client config. + +**Method 2: applicable to MRS 1.9.2 or later** + +#. After the cluster is installed, run the following command to switch to user **omm**. If you use the client outside the cluster, switch to user **root**. + + **sudo su - omm** + +#. Run the following command to switch to the client directory, for example, **/opt/Bigdata/client**: + + **cd /opt/Bigdata/client** + +#. Run the following command and enter the name of an MRS Manager user with the download permission and its password (for example, the username is **admin** and the password is the one set during cluster creation) as prompted to update client configurations. + + **sh autoRefreshConfig.sh** + +#. After the command is executed, the following information is displayed, where *XXX* indicates the name of the component installed in the cluster. To update client configurations of all components, press **Enter**. To update client configurations of some components, enter the component names and separate them with commas (,). + + .. code-block:: + + Components "xxx" have been installed in the cluster. Please input the comma-separated names of the components for which you want to update client configurations. If you press Enter without inputting any component name, the client configurations of all components will be updated: + + If the following information is displayed, the configurations have been updated successfully. + + .. code-block:: + + Succeed to refresh components client config. + + If the following information is displayed, the username or password is incorrect. + + .. code-block:: + + login manager failed,Incorrect username or password. + + .. note:: + + - This script automatically connects to the cluster and invokes the **refreshConfig.sh** script to download and update the client configuration file. + - By default, the client uses the floating IP address specified by **wsom=xxx** in the **Version** file in the installation directory to update the client configurations. To update the configuration file of another cluster, modify the value of **wsom=xxx** in the **Version** file to the floating IP address of the corresponding cluster before performing this step. + +Fully Updating the Original Client of the Active Master Node +------------------------------------------------------------ + +**Scenario** + +During cluster creation, the original client is stored in the **/opt/client** directory on all nodes in the cluster by default. The following uses **/opt/Bigdata/client** as an example. + +- For a normal MRS cluster, you will use the pre-installed client on a Master node to submit a job on the management console page. +- You can also use the pre-installed client on the Master node to connect to a server, view task results, and manage data. + +After installing the patch on the cluster, you need to update the client on the Master node to ensure that the functions of the built-in client are available. + +**Procedure** + +#. .. _mrs_01_2130__en-us_topic_0264269034_li6500547131416: + + Log in to MRS Manager. For details, see :ref:`Accessing MRS Manager (Versions Earlier Than MRS 3.x) `. Then, choose **Services**. + +#. Click **Download Client**. + + Set **Client Type** to **All client files**, **Download To** to **Server**, and click **OK** to generate the client configuration file. The generated file is saved in the **/tmp/MRS-client** directory on the active management node by default. You can customize the file path. + +#. .. _mrs_01_2130__en-us_topic_0264269034_li14850170195112: + + Query and log in to the active Master node. + +#. .. _mrs_01_2130__en-us_topic_0264269034_li3635762195625: + + On the ECS, switch to user **root** and copy the installation package to the **/opt** directory. + + **sudo su - root** + + **cp /tmp/MRS-client/MRS_Services_Client.tar /opt** + +#. Run the following command in the **/opt** directory to decompress the package and obtain the verification file and the configuration package of the client: + + **tar -xvf MRS\_Services_Client.tar** + +#. Run the following command to verify the configuration file package of the client: + + **sha256sum -c MRS\_Services_ClientConfig.tar.sha256** + + The command output is as follows: + + .. code-block:: + + MRS_Services_ClientConfig.tar: OK + +#. Run the following command to decompress **MRS_Services_ClientConfig.tar**: + + **tar -xvf MRS\_Services_ClientConfig.tar** + +#. Run the following command to move the original client to the **/opt/Bigdata/client_bak** directory: + + **mv /opt/Bigdata/client** **/opt/Bigdata/client_bak** + +#. Run the following command to install the client in a new directory. The client path must be **/opt/Bigdata/client**. + + **sh /opt/MRS\_Services_ClientConfig/install.sh /opt/Bigdata/client** + + If the following information is displayed, the client has been successfully installed: + + .. code-block:: + + Components client installation is complete. + +#. Run the following command to modify the user and user group of the **/opt/Bigdata/client** directory: + + **chown omm:wheel /opt/Bigdata/client -R** + +#. Run the following command to configure environment variables: + + **source /opt/Bigdata/client/bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *MRS cluster user* + + Example: **kinit admin** + +#. .. _mrs_01_2130__en-us_topic_0264269034_li6221236418107: + + Run the client command of a component. + + For example, run the following command to query the HDFS directory: + + **hdfs dfs -ls /** + +Fully Updating the Original Client of the Standby Master Node +------------------------------------------------------------- + +#. Repeat :ref:`1 ` to :ref:`3 ` to log in to the standby Master node, and run the following command to switch to user **omm**: + + **sudo su - omm** + +#. Run the following command on the standby master node to copy the downloaded client package from the active master node: + + **scp omm@**\ *master1 nodeIP address*\ **:/tmp/MRS-client/MRS_Services_Client.tar /tmp/MRS-client/** + + .. note:: + + - In this command, **master1** node is the active master node. + - **/tmp/MRS-client/** is an example target directory of the standby master node. + +#. Repeat :ref:`4 ` to :ref:`13 ` to update the client of the standby Master node. diff --git a/doc/component-operation-guide/source/change_history.rst b/doc/component-operation-guide/source/change_history.rst new file mode 100644 index 0000000..bfb634d --- /dev/null +++ b/doc/component-operation-guide/source/change_history.rst @@ -0,0 +1,23 @@ +:original_name: en-us_topic_0000001351362309.html + +.. _en-us_topic_0000001351362309: + +Change History +============== + ++-----------------------------------+-------------------------------------------------------------------------------------------------+ +| Released On | What's New | ++===================================+=================================================================================================+ +| 2022-11\ ``-``\ 01 | Modified the following content: | +| | | +| | Updated the screenshots in the operation guides for ClickHouse, Ranger, Spark2x, Tez, and Yarn. | ++-----------------------------------+-------------------------------------------------------------------------------------------------+ +| 2022-09-29 | Added the ClickHouse component. For details, see :ref:`Using ClickHouse `. | ++-----------------------------------+-------------------------------------------------------------------------------------------------+ +| 2021-09-20 | Added the Hudi component. For details, see :ref:`Using Hudi `. | ++-----------------------------------+-------------------------------------------------------------------------------------------------+ +| 2020-03-18 | - Added the Alluxio component. For details, see :ref:`Using Alluxio `. | +| | - Added the Ranger component. For details, see :ref:`Using Ranger (MRS 1.9.2) `. | ++-----------------------------------+-------------------------------------------------------------------------------------------------+ +| 2017-02-20 | This issue is the first official release. | ++-----------------------------------+-------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/index.rst b/doc/component-operation-guide/source/index.rst index 02742bc..6077114 100644 --- a/doc/component-operation-guide/source/index.rst +++ b/doc/component-operation-guide/source/index.rst @@ -2,3 +2,37 @@ Map Reduce Service - Component Operation Guide ============================================== +.. toctree:: + :maxdepth: 1 + + using_alluxio/index + using_carbondata_for_versions_earlier_than_mrs_3.x/index + using_carbondata_for_mrs_3.x_or_later/index + using_clickhouse/index + using_dbservice/index + using_flink/index + using_flume/index + using_hbase/index + using_hdfs/index + using_hive/index + using_hudi/index + using_hue_versions_earlier_than_mrs_3.x/index + using_hue_mrs_3.x_or_later/index + using_kafka/index + using_kafkamanager/index + using_loader/index + using_mapreduce/index + using_oozie/index + using_opentsdb/index + using_presto/index + using_ranger_mrs_1.9.2/index + using_ranger_mrs_3.x/index + using_spark/index + using_spark2x/index + using_sqoop/index + using_storm/index + using_tez/index + using_yarn/index + using_zookeeper/index + appendix/index + change_history diff --git a/doc/component-operation-guide/source/using_alluxio/accessing_alluxio_using_a_data_application.rst b/doc/component-operation-guide/source/using_alluxio/accessing_alluxio_using_a_data_application.rst new file mode 100644 index 0000000..a6875ca --- /dev/null +++ b/doc/component-operation-guide/source/using_alluxio/accessing_alluxio_using_a_data_application.rst @@ -0,0 +1,163 @@ +:original_name: mrs_01_0760.html + +.. _mrs_01_0760: + +Accessing Alluxio Using a Data Application +========================================== + +The port number used for accessing the Alluxio file system is 19998, and the access address is **alluxio://**\ **\ **:19998/**\ **. This section uses examples to describe how to access the Alluxio file system using data applications (Spark, Hive, Hadoop MapReduce, and Presto). + +Using Alluxio as the Input and Output of a Spark Application +------------------------------------------------------------ + +#. Log in to the Master node in a cluster as user **root** using the password set during cluster creation. + +#. Run the following command to configure environment variables: + + **source /opt/client/bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step: + + **kinit** *MRS cluster user* + + Example: **kinit admin** + +#. Prepare an input file and copy local data to the Alluxio file system. + + For example, prepare the input file **test_input.txt** in the local **/home** directory, and run the following command to save the **test_input.txt** file to Alluxio: + + **alluxio fs copyFromLocal /home/test_input.txt /input** + +#. Run the following commands to start **spark-shell**: + + **spark-shell** + +#. Run the following commands in spark-shell: + + **val s = sc.textFile("alluxio://:19998/input")** + + **val double = s.map(line => line + line)** + + **double.saveAsTextFile("alluxio://:19998/output")** + + .. note:: + + Replace **Name of the Alluxio node>:19998** with the actual node name and port numbers of all nodes where the AlluxioMaster instance is deployed. Use commas (,) to separate the node name and port number, for example, **node-ana-coremspb.mrs-m0va.com:19998,node-master2kiww.mrs-m0va.com:19998,node-master1cqwv.mrs-m0va.com:19998**. + +#. Press **Ctrl+C** to exit spark-shell. + +#. Run the **alluxio fs ls /** command to check whether the output directory **/output** containing double content of the input file exists in the root directory of Alluxio. + +Creating a Hive Table on Alluxio +-------------------------------- + +#. Log in to the Master node in a cluster as user **root** using the password set during cluster creation. + +#. Run the following command to configure environment variables: + + **source /opt/client/bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step: + + **kinit** *MRS cluster user* + + Example: **kinit admin** + +#. Prepare an input file. For example, prepare the **hive_load.txt** input file in the local **/home** directory. The file content is as follows: + + .. code-block:: + + 1, Alice, company A + 2, Bob, company B + +#. Run the following command to import the **hive_load.txt** file to Alluxio: + + **alluxio fs copyFromLocal /home/hive_load.txt /hive_input** + +#. Run the following command to start the Hive beeline: + + **beeline** + +#. Run the following commands in beeline to create a table based on the input file in Alluxio: + + **CREATE TABLE u_user(id INT, name STRING, company STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;** + + **LOAD DATA INPATH 'alluxio://:19998/hive_input' INTO TABLE u_user;** + + .. note:: + + Replace **Name of the Alluxio node>:19998** with the actual node name and port numbers of all nodes where the AlluxioMaster instance is deployed. Use commas (,) to separate the node name and port number, for example, **node-ana-coremspb.mrs-m0va.com:19998,node-master2kiww.mrs-m0va.com:19998,node-master1cqwv.mrs-m0va.com:19998**. + +#. Run the following command to view the created table: + + **select \* from u_user;** + +Running Hadoop Wordcount in Alluxio +----------------------------------- + +#. Log in to the Master node in a cluster as user **root** using the password set during cluster creation. + +#. Run the following command to configure environment variables: + + **source /opt/client/bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step: + + **kinit** *MRS cluster user* + + Example: **kinit admin** + +#. Prepare an input file and copy local data to the Alluxio file system. + + For example, prepare the input file **test_input.txt** in the local **/home** directory, and run the following command to save the **test_input.txt** file to Alluxio: + + **alluxio fs copyFromLocal /home/test_input.txt /input** + +#. Run the following command to execute the wordcount job: + + **yarn jar /opt/share/hadoop-mapreduce-examples--mrs-/hadoop-mapreduce-examples--mrs-.jar wordcount alluxio://:19998/input alluxio://:19998/output** + + .. note:: + + - Replace **** with the actual one. + - Replace **** with the major version of MRS. For example, for a cluster of MRS 1.9.2, mrs-1.9.0 is used. + - Replace **Name of the Alluxio node>:19998** with the actual node name and port numbers of all nodes where the AlluxioMaster instance is deployed. Use commas (,) to separate the node name and port number, for example, **node-ana-coremspb.mrs-m0va.com:19998,node-master2kiww.mrs-m0va.com:19998,node-master1cqwv.mrs-m0va.com:19998**. + +#. Run the **alluxio fs ls /** command to check whether the output directory **/output** containing the wordcount result exists in the root directory of Alluxio. + +Using Presto to Query Tables in Alluxio +--------------------------------------- + +#. Log in to the Master node in a cluster as user **root** using the password set during cluster creation. + +#. Run the following command to configure environment variables: + + **source /opt/client/bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step: + + **kinit** *MRS cluster user* + + Example: **kinit admin** + +#. Run the following commands to start Hive Beeline to create a table on Alluxio. + + **beeline** + + **CREATE TABLE u_user (id int, name string, company string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'alluxio://:19998/u_user';** + + **insert into u_user values(1,'Alice','Company A'),(2, 'Bob', 'Company B');** + + .. note:: + + Replace **Name of the Alluxio node>:19998** with the actual node name and port numbers of all nodes where the AlluxioMaster instance is deployed. Use commas (,) to separate the node name and port number, for example, **node-ana-coremspb.mrs-m0va.com:19998,node-master2kiww.mrs-m0va.com:19998,node-master1cqwv.mrs-m0va.com:19998**. + +#. Start the Presto client. For details, see :ref:`2 ` to :ref:`8 ` in :ref:`Using a Client to Execute Query Statements `. + +#. On the Presto client, run the **select \* from hive.default.u_user;** statement to query the table created in Alluxio: + + + .. figure:: /_static/images/en-us_image_0000001349170061.png + :alt: **Figure 1** Using Presto to query the table created in Alluxio + + **Figure 1** Using Presto to query the table created in Alluxio diff --git a/doc/component-operation-guide/source/using_alluxio/common_operations_of_alluxio.rst b/doc/component-operation-guide/source/using_alluxio/common_operations_of_alluxio.rst new file mode 100644 index 0000000..ef7fba9 --- /dev/null +++ b/doc/component-operation-guide/source/using_alluxio/common_operations_of_alluxio.rst @@ -0,0 +1,166 @@ +:original_name: mrs_01_0757.html + +.. _mrs_01_0757: + +Common Operations of Alluxio +============================ + +Preparations +------------ + +#. Create a cluster with Alluxio installed. + +#. Log in to the active Master node in a cluster as user **root** using the password set during cluster creation. + +#. Run the following command to configure environment variables: + + **source /opt/client/bigdata_env** + +Using the Alluxio Shell +----------------------- + +The `Alluxio shell `__ contains multiple command line operations that interact with Alluxio. + +- View a file system operation command list: + + **alluxio fs** + +- Run the **ls** command to list the files in Alluxio. For example, list all files in the root directory: + + **alluxio fs ls /** + +- Run the **copyFromLocal** command to copy local files to Alluxio: + + **alluxio fs copyFromLocal /home/test_input.txt /test_input.txt** + + Command output: + + .. code-block:: + + Copied file:///home/test_input.txt to /test_input.txt + +- Run the **ls** command again to list the files in Alluxio. The copied **test_input.txt** file is listed: + + **alluxio fs ls /** + + Command output: + + .. code-block:: + + 12 PERSISTED 11-28-2019 17:10:17:449 100% /test_input.txt + + The **test_input.txt** file is displayed in Alluxio. The parameters in the file indicate the file size, whether the file is persistent, creation date, cache ratio of the file in Alluxio, and file name. + +- Run the **cat** command to print file content: + + **alluxio fs cat /test_input.txt** + + Command output: + + .. code-block:: + + Test Alluxio + +Mounting Function of Alluxio +---------------------------- + +Alluxio uses a unified namespace feature to unify the access to storage systems. For details, see https://docs.alluxio.io/os/user/2.0/en/advanced/Namespace-Management.html. + +This feature allows users to mount different storage systems to an Alluxio namespace and seamlessly access files across storage systems through the Alluxio namespace. + +#. Create a directory as a mount point in Alluxio. + + **alluxio fs mkdir /mnt** + + .. code-block:: + + Successfully created directory /mnt + +#. Mount an existing OBS file system to Alluxio. (Prerequisite: An agency with the **OBS OperateAccess** permission has been configured for the cluster. The **obs-mrstest** file system is used as an example. Replace the file system name with the actual one. + + **alluxio fs mount /mnt/obs obs://obs-mrstest/data** + + .. code-block:: + + Mounted obs://obs-mrstest/data at /mnt/obs + +#. List files in the OBS file system using the Alluxio namespace. Run the **ls** command to list the files in the OBS mount directory. + + **alluxio fs ls /mnt/obs** + + .. code-block:: + + 38 PERSISTED 11-28-2019 17:42:54:554 0% /mnt/obs/hive_load.txt + 12 PERSISTED 11-28-2019 17:43:07:743 0% /mnt/obs/test_input.txt + + You can also view the newly mounted files and directories on the Alluxio web UI. + +#. After the mounting is complete, you can seamlessly exchange data between different storage systems through the unified namespace of Alluxio. For example, run the **ls -R** command to list all files in a directory recursively: + + **alluxio fs ls -R /** + + .. code-block:: + + 0 PERSISTED 11-28-2019 11:15:19:719 DIR /app-logs + 1 PERSISTED 11-28-2019 11:18:36:885 DIR /apps + 1 PERSISTED 11-28-2019 11:18:40:209 DIR /apps/templeton + 239440292 PERSISTED 11-28-2019 11:18:40:209 0% /apps/templeton/hive.tar.gz + ..... + 1 PERSISTED 11-28-2019 19:00:23:879 DIR /mnt + 2 PERSISTED 11-28-2019 19:00:23:879 DIR /mnt/obs + 38 PERSISTED 11-28-2019 17:42:54:554 0% /mnt/obs/hive_load.txt + 12 PERSISTED 11-28-2019 17:43:07:743 0% /mnt/obs/test_input.txt + ..... + + The command output shows all files that are from the mounted storage system in the root directory of the Alluxio file system (the default directory is the HDFS root directory, that is, **hdfs://hacluster/**). The **/app-logs** and **/apps** directories are in HDFS, and the **/mnt/obs/** directory is in OBS. + +Using Alluxio to Accelerate Data Access +--------------------------------------- + +Alluxio can accelerate data access, because it uses memory to store data. Example commands are provided as follows: + +#. Upload the **test_data.csv** file (a sample that records recipes) to the **/data** directory of the **obs-mrstest** file system. Run the **ls** command to display the file status. + + **alluxio fs ls /mnt/obs/test_data.csv** + + .. code-block:: + + 294520189 PERSISTED 11-28-2019 19:38:55:000 0% /mnt/obs/test_data.csv + + The output indicates that the cache percentage of the file in Alluxio is 0%, that is, the file is not in Alluxio memory. + +#. Count the occurrence times of the word "milk" in the file, and calculate the time consumed. + + **time alluxio fs cat /mnt/obs/test_data.csv \| grep -c milk** + + .. code-block:: + + 52180 + + real 0m10.765s + user 0m5.540s + sys 0m0.696s + +#. Data is stored in memory after being read for the first time. When Alluxio reads data again, the data access speed is increased. For example, after running the **cat** command to obtain a file, run the **ls** command to check the file status. + + **alluxio fs ls /mnt/obs/test_data.csv** + + .. code-block:: + + 294520189 PERSISTED 11-28-2019 19:38:55:000 100% /mnt/obs/test_data.csv + + The output shows that the file has been fully loaded to Alluxio. + +#. Access the file again, count the occurrence times of the word "eggs", and calculate the time consumed. + + **time alluxio fs cat /mnt/obs/test_data.csv \| grep -c eggs** + + .. code-block:: + + 59510 + + real 0m5.777s + user 0m5.992s + sys 0m0.592s + + According to the comparison of the two time consumption records, the time consumed for accessing data stored in Alluxio memory is significantly reduced. diff --git a/doc/component-operation-guide/source/using_alluxio/configuring_an_underlying_storage_system.rst b/doc/component-operation-guide/source/using_alluxio/configuring_an_underlying_storage_system.rst new file mode 100644 index 0000000..5a6d7b4 --- /dev/null +++ b/doc/component-operation-guide/source/using_alluxio/configuring_an_underlying_storage_system.rst @@ -0,0 +1,31 @@ +:original_name: mrs_01_0759.html + +.. _mrs_01_0759: + +Configuring an Underlying Storage System +======================================== + +If you want to use a unified client API and a global namespace to access persistent storage systems including HDFS and OBS to separate computing from storage, you can configure the underlying storage system of Alluxio on MRS Manager. After a cluster is created, the default underlying storage address is **hdfs://hacluster/**, that is, the HDFS root directory is mapped to Alluxio. + +Prerequisites +------------- + +- Alluxio has been installed in a cluster. +- The password of user **admin** has been obtained. The password of user **admin** is specified by the user during MRS cluster creation. + +Configuring HDFS as the Underlying File System of Alluxio +--------------------------------------------------------- + +.. note:: + + Security clusters with Kerberos authentication enabled do not support this function. + +#. Go to the **All Configurations** page of Alluxio. See :ref:`Modifying Cluster Service Configuration Parameters `. + +#. In the left pane, choose **Alluxio** > **Under Stores**, and modify the value of **alluxio.master.mount.table.root.ufs** to **hdfs://hacluster**\ */XXX/*. + + For example, if you want to use *HDFS root directory*\ **/alluxio/** as the root directory of Alluxio, modify the value of **alluxio.master.mount.table.root.ufs** to **hdfs://hacluster/alluxio/**. + +#. Click **Save Configuration**. In the displayed dialog box, select **Restart the affected services or instances**. + +#. Click **OK** to restart Alluxio. diff --git a/doc/component-operation-guide/source/using_alluxio/index.rst b/doc/component-operation-guide/source/using_alluxio/index.rst new file mode 100644 index 0000000..011b18d --- /dev/null +++ b/doc/component-operation-guide/source/using_alluxio/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0756.html + +.. _mrs_01_0756: + +Using Alluxio +============= + +- :ref:`Configuring an Underlying Storage System ` +- :ref:`Accessing Alluxio Using a Data Application ` +- :ref:`Common Operations of Alluxio ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + configuring_an_underlying_storage_system + accessing_alluxio_using_a_data_application + common_operations_of_alluxio diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_access_control.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_access_control.rst new file mode 100644 index 0000000..95bf9e3 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_access_control.rst @@ -0,0 +1,98 @@ +:original_name: mrs_01_1422.html + +.. _mrs_01_1422: + +CarbonData Access Control +========================= + +The following table provides details about Hive ACL permissions required for performing operations on CarbonData tables. + +Prerequisites +------------- + +Parameters listed in :ref:`Table 5 ` or :ref:`Table 6 ` have been configured. + +Hive ACL permissions +-------------------- + +.. table:: **Table 1** Hive ACL permissions required for CarbonData table-level operations + + +--------------------------------------+---------------------------------------------------------------------------------+ + | Scenario | Required Permission | + +======================================+=================================================================================+ + | DESCRIBE TABLE | SELECT (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | SELECT | SELECT (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | EXPLAIN | SELECT (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | CREATE TABLE | CREATE (of database) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | CREATE TABLE As SELECT | CREATE (on database), INSERT (on table), RW on data file, and SELECT (on table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | LOAD | INSERT (of table) RW on data file | + +--------------------------------------+---------------------------------------------------------------------------------+ + | DROP TABLE | OWNER (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | DELETE SEGMENTS | DELETE (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | SHOW SEGMENTS | SELECT (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | CLEAN FILES | DELETE (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | INSERT OVERWRITE / INSERT INTO | INSERT (of table) RW on data file and SELECT (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | CREATE INDEX | OWNER (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | DROP INDEX | OWNER (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | SHOW INDEXES | SELECT (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | ALTER TABLE ADD COLUMN | OWNER (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | ALTER TABLE DROP COLUMN | OWNER (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | ALTER TABLE CHANGE DATATYPE | OWNER (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | ALTER TABLE RENAME | OWNER (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | ALTER TABLE COMPACTION | INSERT (on table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | FINISH STREAMING | OWNER (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | ALTER TABLE SET STREAMING PROPERTIES | OWNER (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | ALTER TABLE SET TABLE PROPERTIES | OWNER (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | UPDATE CARBON TABLE | UPDATE (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | DELETE RECORDS | DELETE (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | REFRESH TABLE | OWNER (of main table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | REGISTER INDEX TABLE | OWNER (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | SHOW PARTITIONS | SELECT (on table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | ALTER TABLE ADD PARTITION | OWNER (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + | ALTER TABLE DROP PARTITION | OWNER (of table) | + +--------------------------------------+---------------------------------------------------------------------------------+ + +.. note:: + + - If tables in the database are created by multiple users, the **Drop database** command fails to be executed even if the user who runs the command is the owner of the database. + + - In a secondary index, when the parent table is triggered, **insert** and **compaction** are triggered on the index table. If you select a query that has a filter condition that matches index table columns, you should provide selection permissions for the parent table and index table. + + - The LockFiles folder and lock files created in the LockFiles folder will have full permissions, as the LockFiles folder does not contain any sensitive data. + + - If you are using ACL, ensure you do not configure any path for DDL or DML which is being used by other process. You are advised to create new paths. + + Configure the path for the following configuration items: + + 1) carbon.badRecords.location + + 2) Db_Path and other items during database creation + + - For Carbon ACL in a non-security cluster, **hive.server2.enable.doAs** in the **hive-site.xml** file must be set to **false**. Then the query will run as the user who runs the hiveserver2 process. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/how_do_i_configure_unsafe_memory_in_carbondata.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/how_do_i_configure_unsafe_memory_in_carbondata.rst new file mode 100644 index 0000000..9048bd6 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/how_do_i_configure_unsafe_memory_in_carbondata.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_1472.html + +.. _mrs_01_1472: + +How Do I Configure Unsafe Memory in CarbonData? +=============================================== + +Question +-------- + +How do I configure unsafe memory in CarbonData? + +Answer +------ + +In the Spark configuration, the value of **spark.yarn.executor.memoryOverhead** must be greater than the sum of (**sort.inmemory.size.inmb** + **Netty offheapmemory required**), or the sum of (**carbon.unsafe.working.memory.in.mb** + **carbon.sort.inememory.storage.size.in.mb** + **Netty offheapmemory required**). Otherwise, if off-heap access exceeds the configured executor memory, Yarn may stop the executor. + +If **spark.shuffle.io.preferDirectBufs** is set to **true**, the netty transfer service in Spark takes off some heap memory (around 384 MB or 0.1 x executor memory) from **spark.yarn.executor.memoryOverhead**. + +For details, see :ref:`Configuring Executor Off-Heap Memory `. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/how_do_i_logically_split_data_across_different_namespaces.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/how_do_i_logically_split_data_across_different_namespaces.rst new file mode 100644 index 0000000..dad836f --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/how_do_i_logically_split_data_across_different_namespaces.rst @@ -0,0 +1,67 @@ +:original_name: mrs_01_1469.html + +.. _mrs_01_1469: + +How Do I Logically Split Data Across Different Namespaces? +========================================================== + +Question +-------- + +How do I logically split data across different namespaces? + +Answer +------ + +- Configuration: + + To logically split data across different namespaces, you must update the following configuration in the **core-site.xml** file of HDFS, Hive, and Spark. + + .. note:: + + Changing the Hive component will change the locations of carbonstore and warehouse. + + - Configuration in HDFS + + - **fs.defaultFS**: Name of the default file system. The URI mode must be set to **viewfs**. When **viewfs** is used, the permission part must be **ClusterX**. + - **fs.viewfs.mountable.ClusterX.homedir**: Home directory base path. You can use the getHomeDirectory() method defined in **FileSystem/FileContext** to access the home directory. + - fs.viewfs.mountable.default.link.: ViewFS mount table. + + Example: + + .. code-block:: + + + fs.defaultFS + viewfs://ClusterX/ + + + fs.viewfs.mounttable.ClusterX.link./folder1 + hdfs://NS1/folder1 + + + fs.viewfs.mounttable.ClusterX.link./folder2 + hdfs://NS2/folder2 + + + - Configurations in Hive and Spark + + **fs.defaultFS**: Name of the default file system. The URI mode must be set to **viewfs**. When **viewfs** is used, the permission part must be **ClusterX**. + +- Syntax: + + **LOAD DATA INPATH** *'path to data' INTO TABLE table_name OPTIONS ``('...');``* + + .. note:: + + When Spark is configured with the viewFS file system and attempts to load data from HDFS, users must specify a path such as **viewfs://** or a relative path as the file path in the **LOAD** statement. + +- Example: + + - Sample viewFS path: + + **LOAD DATA INPATH** *'viewfs://ClusterX/dir/data.csv' INTO TABLE table_name OPTIONS ``('...');``* + + - Sample relative path: + + **LOAD DATA INPATH** *'/apps/input_data1.txt'* **INTO TABLE** *table_name*; diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/how_to_avoid_minor_compaction_for_historical_data.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/how_to_avoid_minor_compaction_for_historical_data.rst new file mode 100644 index 0000000..69ff941 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/how_to_avoid_minor_compaction_for_historical_data.rst @@ -0,0 +1,31 @@ +:original_name: mrs_01_1459.html + +.. _mrs_01_1459: + +How to Avoid Minor Compaction for Historical Data? +================================================== + +Question +-------- + +How to avoid minor compaction for historical data? + +Answer +------ + +If you want to load historical data first and then the incremental data, perform following steps to avoid minor compaction of historical data: + +#. Load all historical data. +#. Configure the major compaction size to a value smaller than the segment size of historical data. +#. Run the major compaction once on historical data so that these segments will not be considered later for minor compaction. +#. Load the incremental data. +#. You can configure the minor compaction threshold as required. + +For example: + +#. Assume that you have loaded all historical data to CarbonData and the size of each segment is 500 GB. +#. Set the threshold of major compaction property to **carbon.major.compaction.size** = **491520** (480 GB x 1024). +#. Run major compaction. All segments will be compacted because the size of each segment is more than configured size. +#. Perform incremental loading. +#. Configure the minor compaction threshold to **carbon.compaction.level.threshold** = **6,6**. +#. Run minor compaction. As a result, only incremental data is compacted. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/how_to_change_the_default_group_name_for_carbondata_data_loading.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/how_to_change_the_default_group_name_for_carbondata_data_loading.rst new file mode 100644 index 0000000..431f7e8 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/how_to_change_the_default_group_name_for_carbondata_data_loading.rst @@ -0,0 +1,19 @@ +:original_name: mrs_01_1460.html + +.. _mrs_01_1460: + +How to Change the Default Group Name for CarbonData Data Loading? +================================================================= + +Question +-------- + +How to change the default group name for CarbonData data loading? + +Answer +------ + +By default, the group name for CarbonData data loading is **ficommon**. You can perform the following operation to change the default group name: + +#. Edit the **carbon.properties** file. +#. Change the value of the key **carbon.dataload.group.name** as required. The default value is **ficommon**. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/index.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/index.rst new file mode 100644 index 0000000..daec6d8 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/index.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_1457.html + +.. _mrs_01_1457: + +CarbonData FAQ +============== + +- :ref:`Why Is Incorrect Output Displayed When I Perform Query with Filter on Decimal Data Type Values? ` +- :ref:`How to Avoid Minor Compaction for Historical Data? ` +- :ref:`How to Change the Default Group Name for CarbonData Data Loading? ` +- :ref:`Why Does INSERT INTO CARBON TABLE Command Fail? ` +- :ref:`Why Is the Data Logged in Bad Records Different from the Original Input Data with Escape Characters? ` +- :ref:`Why Data Load Performance Decreases due to Bad Records? ` +- :ref:`Why INSERT INTO/LOAD DATA Task Distribution Is Incorrect and the Opened Tasks Are Less Than the Available Executors when the Number of Initial Executors Is Zero? ` +- :ref:`Why Does CarbonData Require Additional Executors Even Though the Parallelism Is Greater Than the Number of Blocks to Be Processed? ` +- :ref:`Why Data loading Fails During off heap? ` +- :ref:`Why Do I Fail to Create a Hive Table? ` +- :ref:`Why CarbonData tables created in V100R002C50RC1 not reflecting the privileges provided in Hive Privileges for non-owner? ` +- :ref:`How Do I Logically Split Data Across Different Namespaces? ` +- :ref:`Why Missing Privileges Exception is Reported When I Perform Drop Operation on Databases? ` +- :ref:`Why the UPDATE Command Cannot Be Executed in Spark Shell? ` +- :ref:`How Do I Configure Unsafe Memory in CarbonData? ` +- :ref:`Why Exception Occurs in CarbonData When Disk Space Quota is Set for Storage Directory in HDFS? ` +- :ref:`Why Does Data Query or Loading Fail and "org.apache.carbondata.core.memory.MemoryException: Not enough memory" Is Displayed? ` +- :ref:`Why Do Files of a Carbon Table Exist in the Recycle Bin Even If the drop table Command Is Not Executed When Mis-deletion Prevention Is Enabled? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + why_is_incorrect_output_displayed_when_i_perform_query_with_filter_on_decimal_data_type_values + how_to_avoid_minor_compaction_for_historical_data + how_to_change_the_default_group_name_for_carbondata_data_loading + why_does_insert_into_carbon_table_command_fail + why_is_the_data_logged_in_bad_records_different_from_the_original_input_data_with_escape_characters + why_data_load_performance_decreases_due_to_bad_records + why_insert_into_load_data_task_distribution_is_incorrect_and_the_opened_tasks_are_less_than_the_available_executors_when_the_number_of_initial_executorsis_zero + why_does_carbondata_require_additional_executors_even_though_the_parallelism_is_greater_than_the_number_of_blocks_to_be_processed + why_data_loading_fails_during_off_heap + why_do_i_fail_to_create_a_hive_table + why_carbondata_tables_created_in_v100r002c50rc1_not_reflecting_the_privileges_provided_in_hive_privileges_for_non-owner + how_do_i_logically_split_data_across_different_namespaces + why_missing_privileges_exception_is_reported_when_i_perform_drop_operation_on_databases + why_the_update_command_cannot_be_executed_in_spark_shell + how_do_i_configure_unsafe_memory_in_carbondata + why_exception_occurs_in_carbondata_when_disk_space_quota_is_set_for_storage_directory_in_hdfs + why_does_data_query_or_loading_fail_and_org.apache.carbondata.core.memory.memoryexception_not_enough_memory_is_displayed + why_do_files_of_a_carbon_table_exist_in_the_recycle_bin_even_if_the_drop_table_command_is_not_executed_when_mis-deletion_prevention_is_enabled diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_carbondata_tables_created_in_v100r002c50rc1_not_reflecting_the_privileges_provided_in_hive_privileges_for_non-owner.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_carbondata_tables_created_in_v100r002c50rc1_not_reflecting_the_privileges_provided_in_hive_privileges_for_non-owner.rst new file mode 100644 index 0000000..516cbc0 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_carbondata_tables_created_in_v100r002c50rc1_not_reflecting_the_privileges_provided_in_hive_privileges_for_non-owner.rst @@ -0,0 +1,30 @@ +:original_name: mrs_01_1468.html + +.. _mrs_01_1468: + +Why CarbonData tables created in V100R002C50RC1 not reflecting the privileges provided in Hive Privileges for non-owner? +======================================================================================================================== + +Question +-------- + +Why CarbonData tables created in V100R002C50RC1 not reflecting the privileges provided in Hive Privileges for non-owner? + +Answer +------ + +The Hive ACL is implemented after the version V100R002C50RC1, hence the Hive ACL Privileges are not reflecting. + +To support HIVE ACL Privileges for CarbonData tables created in V100R002C50RC1, following two ALTER TABLE commands must be executed by owner of the table. + +**ALTER TABLE** *$dbname.$tablename SET LOCATION '$carbon.store/$dbname/$tablename';* + +**ALTER TABLE** *$dbname.$tablename SET SERDEPROPERTIES ('path'='$carbon.store/$dbname/$tablename');* + +Example: + +Assume database name is 'carbondb', table name is 'carbontable', and CarbonData store location is 'hdfs://hacluster/user/hive/warehouse/carbon.store', then the commands should be executed is as follows: + +**ALTER TABLE** *carbondb.carbontable SET LOCATION 'hdfs://hacluster/user/hive/warehouse/carbon.store/carbondb/carbontable';* + +**ALTER TABLE** *carbondb.carbontable SET SERDEPROPERTIES ('path'='hdfs://hacluster/user/hive/warehouse/carbon.store/carbondb/carbontable');* diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_data_load_performance_decreases_due_to_bad_records.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_data_load_performance_decreases_due_to_bad_records.rst new file mode 100644 index 0000000..3f65d95 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_data_load_performance_decreases_due_to_bad_records.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_1463.html + +.. _mrs_01_1463: + +Why Data Load Performance Decreases due to Bad Records? +======================================================= + +Question +-------- + +Why data load performance decreases due to bad records? + +Answer +------ + +If bad records are present in the data and **BAD_RECORDS_LOGGER_ENABLE** is **true** or **BAD_RECORDS_ACTION** is **redirect** then load performance will decrease due to extra I/O for writing failure reason in log file or redirecting the records to raw CSV. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_data_loading_fails_during_off_heap.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_data_loading_fails_during_off_heap.rst new file mode 100644 index 0000000..787b78c --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_data_loading_fails_during_off_heap.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_1466.html + +.. _mrs_01_1466: + +Why Data loading Fails During off heap? +======================================= + +Question +-------- + +Why Data Loading fails during off heap? + +Answer +------ + +YARN Resource Manager will consider (Java heap memory + **spark.yarn.am.memoryOverhead**) as memory limit, so during the off heap, the memory can exceed this limit. So you need to increase the memory by increasing the value of the parameter **spark.yarn.am.memoryOverhead**. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_do_files_of_a_carbon_table_exist_in_the_recycle_bin_even_if_the_drop_table_command_is_not_executed_when_mis-deletion_prevention_is_enabled.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_do_files_of_a_carbon_table_exist_in_the_recycle_bin_even_if_the_drop_table_command_is_not_executed_when_mis-deletion_prevention_is_enabled.rst new file mode 100644 index 0000000..740c536 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_do_files_of_a_carbon_table_exist_in_the_recycle_bin_even_if_the_drop_table_command_is_not_executed_when_mis-deletion_prevention_is_enabled.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_24537.html + +.. _mrs_01_24537: + +Why Do Files of a Carbon Table Exist in the Recycle Bin Even If the drop table Command Is Not Executed When Mis-deletion Prevention Is Enabled? +=============================================================================================================================================== + +Question +-------- + +Why do files of a Carbon table exist in the recycle bin even if the **drop table** command is not executed when mis-deletion prevention is enabled? + +Answer +------ + +After the the mis-deletion prevention is enabled for a Carbon table, calling a file deletion command will move the deleted files to the recycle bin. The intermediate file **.carbonindex** is deleted durtion the execution of the **insert** or **load** command. Therefore, the table files may exist in the recycle bin even through the **drop table** command is not executed. If you run the **drop table** command, a table directory with a timestamp is generated. The files in the directory are complete. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_do_i_fail_to_create_a_hive_table.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_do_i_fail_to_create_a_hive_table.rst new file mode 100644 index 0000000..b97cd92 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_do_i_fail_to_create_a_hive_table.rst @@ -0,0 +1,26 @@ +:original_name: mrs_01_1467.html + +.. _mrs_01_1467: + +Why Do I Fail to Create a Hive Table? +===================================== + +Question +-------- + +Why do I fail to create a hive table? + +Answer +------ + +Creating a Hive table fails, when source table or sub query has more number of partitions. The implementation of the query requires a lot of tasks, then the number of files will be output a lot, resulting OOM in Driver. + +It can be solved by using **distribute by** on suitable cardinality(distinct values) column in the statement of Hive table creation. + +**distribute by** clause limits number of hive table partitions. It considers cardinality of given column or **spark.sql.shuffle.partitions** which ever is minimal. For example, if **spark.sql.shuffle.partitions** is 200, but cardinality of column is 100, out files is 200, but the other 100 files are empty. So using very low cardinality column like 1 will cause data skew and will effect later query distribution. + +So we suggest using the column with cardinality greater than **spark.sql.shuffle.partitions**. It can be greater than 2 to 3 times. + +Example: + +**create table hivetable1 as select \* from sourcetable1 distribute by col_age;** diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_does_carbondata_require_additional_executors_even_though_the_parallelism_is_greater_than_the_number_of_blocks_to_be_processed.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_does_carbondata_require_additional_executors_even_though_the_parallelism_is_greater_than_the_number_of_blocks_to_be_processed.rst new file mode 100644 index 0000000..21fe074 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_does_carbondata_require_additional_executors_even_though_the_parallelism_is_greater_than_the_number_of_blocks_to_be_processed.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_1465.html + +.. _mrs_01_1465: + +Why Does CarbonData Require Additional Executors Even Though the Parallelism Is Greater Than the Number of Blocks to Be Processed? +================================================================================================================================== + +Question +-------- + +Why does CarbonData require additional executors even though the parallelism is greater than the number of blocks to be processed? + +Answer +------ + +CarbonData block distribution optimizes data processing as follows: + +#. Optimize data processing parallelism. +#. Optimize parallel reading of block data. + +To optimize parallel processing and parallel read, CarbonData requests executors based on the locality of blocks so that it can obtain executors on all nodes. + +If you are using dynamic allocation, you need to configure the following properties: + +#. Set **spark.dynamicAllocation.executorIdleTimeout** to 15 minutes (or the average query time). +#. Set **spark.dynamicAllocation.maxExecutors** correctly. The default value **2048** is not recommended. Otherwise, CarbonData will request the maximum number of executors. +#. For a bigger cluster, set **carbon.dynamicAllocation.schedulerTimeout** to a value ranging from 10 to 15 seconds. The default value is 5 seconds. +#. Set **carbon.scheduler.minRegisteredResourcesRatio** to a value ranging from 0.1 to 1.0. The default value is **0.8**. Block distribution can be started as long as the value of **carbon.scheduler.minRegisteredResourcesRatio** is within the range. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_does_data_query_or_loading_fail_and_org.apache.carbondata.core.memory.memoryexception_not_enough_memory_is_displayed.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_does_data_query_or_loading_fail_and_org.apache.carbondata.core.memory.memoryexception_not_enough_memory_is_displayed.rst new file mode 100644 index 0000000..2a9db73 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_does_data_query_or_loading_fail_and_org.apache.carbondata.core.memory.memoryexception_not_enough_memory_is_displayed.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_1474.html + +.. _mrs_01_1474: + +Why Does Data Query or Loading Fail and "org.apache.carbondata.core.memory.MemoryException: Not enough memory" Is Displayed? +============================================================================================================================ + +Question +-------- + +Why does data query or loading fail and "org.apache.carbondata.core.memory.MemoryException: Not enough memory" is displayed? + +Answer +------ + +This exception is thrown when the out-of-heap memory required for data query and loading in the executor is insufficient. + +In this case, increase the values of **carbon.unsafe.working.memory.in.mb** and **spark.yarn.executor.memoryOverhead**. + +For details, see :ref:`How Do I Configure Unsafe Memory in CarbonData? `. + +The memory is shared by data query and loading. Therefore, if the loading and query operations need to be performed at the same time, you are advised to set **carbon.unsafe.working.memory.in.mb** and **spark.yarn.executor.memoryOverhead** to a value greater than 2,048 MB. + +The following formula can be used for estimation: + +Memory required for data loading: + +carbon.number.of.cores.while.loading [default value is 6] x Number of tables to load in parallel x offheap.sort.chunk.size.inmb [default value is 64 MB] + carbon.blockletgroup.size.in.mb [default value is 64 MB] + Current compaction ratio [64 MB/3.5]) + += Around 900 MB per table + +Memory required for data query: + +(SPARK_EXECUTOR_INSTANCES. [default value is 2] x (carbon.blockletgroup.size.in.mb [default value: 64 MB] + carbon.blockletgroup.size.in.mb [default value = 64 MB x 3.5) x Number of cores per executor [default value: 1]) + += ~ 600 MB diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_does_insert_into_carbon_table_command_fail.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_does_insert_into_carbon_table_command_fail.rst new file mode 100644 index 0000000..10e8d18 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_does_insert_into_carbon_table_command_fail.rst @@ -0,0 +1,54 @@ +:original_name: mrs_01_1461.html + +.. _mrs_01_1461: + +Why Does INSERT INTO CARBON TABLE Command Fail? +=============================================== + +Question +-------- + +Why does the **INSERT INTO CARBON TABLE** command fail and the following error message is displayed? + +.. code-block:: + + Data load failed due to bad record + +Answer +------ + +The **INSERT INTO CARBON TABLE** command fails in the following scenarios: + +- If the data type of source and target table columns are not the same, the data from the source table will be treated as bad records and the **INSERT INTO** command fails. + +- If the result of aggregation function on a source column exceeds the maximum range of the target column, then the **INSERT INTO** command fails. + + Solution: + + You can use the cast function on corresponding columns when inserting records. + + For example: + + #. Run the **DESCRIBE** command to query the target and source table. + + **DESCRIBE** *newcarbontable*; + + Result: + + .. code-block:: + + col1 int + col2 bigint + + **DESCRIBE** *sourcetable*; + + Result: + + .. code-block:: + + col1 int + col2 int + + #. Add the cast function to convert bigint value to integer. + + **INSERT INTO** *newcarbontable select col1, cast(col2 as integer) from sourcetable;* diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_exception_occurs_in_carbondata_when_disk_space_quota_is_set_for_storage_directory_in_hdfs.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_exception_occurs_in_carbondata_when_disk_space_quota_is_set_for_storage_directory_in_hdfs.rst new file mode 100644 index 0000000..721f1d5 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_exception_occurs_in_carbondata_when_disk_space_quota_is_set_for_storage_directory_in_hdfs.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_1473.html + +.. _mrs_01_1473: + +Why Exception Occurs in CarbonData When Disk Space Quota is Set for Storage Directory in HDFS? +============================================================================================== + +Question +-------- + +Why exception occurs in CarbonData when Disk Space Quota is set for the storage directory in HDFS? + +Answer +------ + +The data will be written to HDFS when you during create table, load table, update table, and so on. If the configured HDFS directory does not have sufficient disk space quota, then the operation will fail and throw following exception. + +.. code-block:: + + org.apache.hadoop.hdfs.protocol.DSQuotaExceededException: + The DiskSpace quota of /user/tenant is exceeded: + quota = 314572800 B = 300 MB but diskspace consumed = 402653184 B = 384 MB at + org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyStoragespaceQuota(DirectoryWithQuotaFeature.java:211) at + org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyQuota(DirectoryWithQuotaFeature.java:239) at + org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyQuota(FSDirectory.java:941) at + org.apache.hadoop.hdfs.server.namenode.FSDirectory.updateCount(FSDirectory.java:745) + +If such exception occurs, configure a sufficient disk space quota for the tenant. + +For example: + +If the HDFS replication factor is 3 and HDFS default block size is 128 MB, then at least 384 MB (no. of block x block_size x replication_factor of the schema file = 1 x 128 x 3 = 384 MB) disk space quota is required to write a table schema file to HDFS. + +.. note:: + + In case of fact files, as the default block size is 1024 MB, the minimum space required is 3072 MB per fact file for data load. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_insert_into_load_data_task_distribution_is_incorrect_and_the_opened_tasks_are_less_than_the_available_executors_when_the_number_of_initial_executorsis_zero.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_insert_into_load_data_task_distribution_is_incorrect_and_the_opened_tasks_are_less_than_the_available_executors_when_the_number_of_initial_executorsis_zero.rst new file mode 100644 index 0000000..055744c --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_insert_into_load_data_task_distribution_is_incorrect_and_the_opened_tasks_are_less_than_the_available_executors_when_the_number_of_initial_executorsis_zero.rst @@ -0,0 +1,30 @@ +:original_name: mrs_01_1464.html + +.. _mrs_01_1464: + +Why INSERT INTO/LOAD DATA Task Distribution Is Incorrect and the Opened Tasks Are Less Than the Available Executors when the Number of Initial Executors Is Zero? +================================================================================================================================================================= + +Question +-------- + +Why **INSERT INTO or LOAD DATA** task distribution is incorrect, and the openedtasks are less than the available executors when the number of initial executors is zero\ **?** + +Answer +------ + +In case of INSERT INTO or LOAD DATA, CarbonData distributes one task per node. If the executors are not allocated from the distinct nodes then CarbonData will launch fewer tasks. + +**Solution**: + +Configure higher value for the executor memory and core so that the yarn can launch only one executor per node. + +#. Configure the number of the Executor cores. + + - Configure the **spark.executor.cores** in **spark-defaults.conf** or the **SPARK_EXECUTOR_CORES** in **spark-env.sh** appropriately. + - Add **--executor-cores NUM** parameter to configure the cores during use the spark-submit command. + +#. Configure the Executor memory. + + - Configure the **spark.executor.memory** in **spark-defaults.conf** or the **SPARK_EXECUTOR_MEMORY** in **spark-env.sh** appropriately. + - Add **--executor-memory MEM** parameter to configure the memory during use the spark-submit command. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_is_incorrect_output_displayed_when_i_perform_query_with_filter_on_decimal_data_type_values.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_is_incorrect_output_displayed_when_i_perform_query_with_filter_on_decimal_data_type_values.rst new file mode 100644 index 0000000..f6751b8 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_is_incorrect_output_displayed_when_i_perform_query_with_filter_on_decimal_data_type_values.rst @@ -0,0 +1,45 @@ +:original_name: mrs_01_1458.html + +.. _mrs_01_1458: + +Why Is Incorrect Output Displayed When I Perform Query with Filter on Decimal Data Type Values? +=============================================================================================== + +Question +-------- + +Why is incorrect output displayed when I perform query with filter on decimal data type values? + +For example: + +**select \* from carbon_table where num = 1234567890123456.22;** + +Output: + +.. code-block:: + + +------+---------------------+--+ + | name | num | + +------+---------------------+--+ + | IAA | 1234567890123456.22 | + | IAA | 1234567890123456.21 | + +------+---------------------+--+ + +Answer +------ + +To obtain accurate output, append BD to the number. + +For example: + +**select \* from carbon_table where num = 1234567890123456.22BD;** + +Output: + +.. code-block:: + + +------+---------------------+--+ + | name | num | + +------+---------------------+--+ + | IAA | 1234567890123456.22 | + +------+---------------------+--+ diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_is_the_data_logged_in_bad_records_different_from_the_original_input_data_with_escape_characters.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_is_the_data_logged_in_bad_records_different_from_the_original_input_data_with_escape_characters.rst new file mode 100644 index 0000000..0373e85 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_is_the_data_logged_in_bad_records_different_from_the_original_input_data_with_escape_characters.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_1462.html + +.. _mrs_01_1462: + +Why Is the Data Logged in Bad Records Different from the Original Input Data with Escape Characters? +==================================================================================================== + +Question +-------- + +Why is the data logged in bad records different from the original input data with escaped characters? + +Answer +------ + +An escape character is a backslash (\\) followed by one or more characters. If the input records contain escape characters such as \\t, \\b, \\n, \\r, \\f, \\', \\", \\\\ , java will process the escape character '\\' and the following characters together to obtain the escaped meaning. + +For example, if the CSV data type **2010\\\\10,test** is inserted to String,int type, the value is treated as bad records, because **tes**\ t cannot be converted to int. The value logged in the bad records is **2010\\10** because java processes **\\\\** as **\\**. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_missing_privileges_exception_is_reported_when_i_perform_drop_operation_on_databases.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_missing_privileges_exception_is_reported_when_i_perform_drop_operation_on_databases.rst new file mode 100644 index 0000000..a366220 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_missing_privileges_exception_is_reported_when_i_perform_drop_operation_on_databases.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_1470.html + +.. _mrs_01_1470: + +Why Missing Privileges Exception is Reported When I Perform Drop Operation on Databases? +======================================================================================== + +Question +-------- + +Why drop database cascade is throwing the following exception? + +.. code-block:: + + Error: org.apache.spark.sql.AnalysisException: Missing Privileges;(State=,code=0) + +Answer +------ + +This error is thrown when the owner of the database performs **drop database cascade** which contains tables created by other users. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_the_update_command_cannot_be_executed_in_spark_shell.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_the_update_command_cannot_be_executed_in_spark_shell.rst new file mode 100644 index 0000000..3b03b17 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_faq/why_the_update_command_cannot_be_executed_in_spark_shell.rst @@ -0,0 +1,32 @@ +:original_name: mrs_01_1471.html + +.. _mrs_01_1471: + +Why the UPDATE Command Cannot Be Executed in Spark Shell? +========================================================= + +Question +-------- + +Why the UPDATE command cannot be executed in Spark Shell? + +Answer +------ + +The syntax and examples provided in this document are about Beeline commands instead of Spark Shell commands. + +To run the UPDATE command in Spark Shell, use the following syntax: + +- Syntax 1 + + **.sql("UPDATE SET (column_name1, column_name2, ... column_name n) = (column1_expression , column2_expression , column3_expression ... column n_expression) [ WHERE { } ];").show** + +- Syntax 2 + + **.sql("UPDATE SET (column_name1, column_name2,) = (select sourceColumn1, sourceColumn2 from sourceTable [ WHERE { } ] ) [ WHERE { } ];").show** + +Example: + +If the context of CarbonData is **carbon**, run the following command: + +**carbon.sql("update carbonTable1 d set (d.column3,d.column5) = (select s.c33 ,s.c55 from sourceTable1 s where d.column1 = s.c11) where d.column1 = 'country' exists( select \* from table3 o where o.c2 > 1);").show** diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_data_migration.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_data_migration.rst new file mode 100644 index 0000000..bfc4b68 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_data_migration.rst @@ -0,0 +1,104 @@ +:original_name: mrs_01_1416.html + +.. _mrs_01_1416: + +CarbonData Data Migration +========================= + +Scenario +-------- + +If you want to rapidly migrate CarbonData data from a cluster to another one, you can use the CarbonData backup and restoration commands. This method does not require data import in the target cluster, reducing required migration time. + +Prerequisites +------------- + +The Spark2x client has been installed in a directory, for example, **/opt/client**, in two clusters. The source cluster is cluster A, and the target cluster is cluster B. + +Procedure +--------- + +#. Log in to the node where the client is installed in cluster A as a client installation user. + +#. Run the following commands to configure environment variables: + + **source /opt/client/bigdata_env** + + **source /opt/client/Spark2x/component_env** + +#. If the cluster is in security mode, run the following command to authenticate the user. In normal mode, skip user authentication. + + **kinit** *carbondatauser* + + *carbondatauser* indicates the user of the original data. That is, the user has the read and write permissions for the tables. + + .. note:: + + You must add the user to the **hadoop** (primary group) and **hive** groups, and associate it with the **System_administrator** role. + +#. Run the following command to connect to the database and check the location for storing table data on HDFS: + + **spark-beeline** + + **desc formatted** *Name of the table containing the original data*\ **;** + + **Location** in the displayed information indicates the directory where the data file resides. + +#. Log in to the node where the client is installed in cluster B as a client installation user and configure the environment variables: + + **source /opt/client/bigdata_env** + + **source /opt/client/Spark2x/component_env** + +#. If the cluster is in security mode, run the following command to authenticate the user. In normal mode, skip user authentication. + + **kinit** *carbondatauser2* + + *carbondatauser2* indicates the user that uploads data. + + .. note:: + + You must add the user to the **hadoop** (primary group) and **hive** groups, and associate it with the **System_administrator** role. + +#. Run the **spark-beeline** command to connect to the database. + +#. Does the database that maps to the original data exist? + + - If yes, go to :ref:`9 `. + - If no, run the **create database** *Database name* command to create a database with the same name as that maps to the original data and go to :ref:`9 `. + +#. .. _mrs_01_1416__lb95e9d29c6fc469a8375f190f4136467: + + Copy the original data from the HDFS directory in cluster A to that in cluster B. + + When uploading data in cluster B, ensure that the upload directory has the directories with the same names as the database and table in the original directory and the upload user has the permission to write data to the upload directory. After the data is uploaded, the user has the permission to read and write the data. + + For example, if the original data is stored in **/user/carboncadauser/warehouse/db1/tb1**, the data can be stored in **/user/carbondatauser2/warehouse/db1/tb1** in the new cluster. + + a. Run the following command to download the original data to the **/opt/backup** directory of cluster A: + + **hdfs dfs -get** **/user/carboncadauser/warehouse/db1/tb1** **/opt/backup** + + b. Run the following command to copy the original data of cluster A to the **/opt/backup** directory on the client node of cluster B. + + **scp /opt/backup root@**\ *IP address of the client node of cluster B*:**/opt/backup** + + c. Run the following command to upload the data copied to cluster B to HDFS: + + **hdfs dfs -put** **/opt/backup** **/user/carbondatauser2/warehouse/db1/tb1** + +#. .. _mrs_01_1416__laf7ce95fc3cc4ab2a96640541690ed30: + + In the client environment of cluster B, run the following command to generate the metadata associated with the table corresponding to the original data in Hive: + + **REFRESH TABLE** *$dbName.$tbName*\ **;** + + *$dbName* indicates the database name, and *$tbName* indicates the table name. + +#. If the original table contains an index table, perform :ref:`9 ` and :ref:`10 ` to migrate the index table directory from cluster A to cluster B. + +#. Run the following command to register an index table for the CarbonData table (skip this step if no index table is created for the original table): + + **REGISTER INDEX TABLE** *$tableName* ON *$maintable*; + + *$tableName* indicates the index table name, and *$maintable* indicates the table name. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_quick_start.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_quick_start.rst new file mode 100644 index 0000000..8a584c2 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_quick_start.rst @@ -0,0 +1,177 @@ +:original_name: mrs_01_1406.html + +.. _mrs_01_1406: + +CarbonData Quick Start +====================== + +This section describes how to create CarbonData tables, load data, and query data. This quick start provides operations based on the Spark Beeline client. If you want to use Spark shell, wrap the queries with **spark.sql()**. + +**The following describes how to load data from a CSV file to a CarbonData table**. + +.. table:: **Table 1** CarbonData Quick Start + + +-----------------------------------------------------------------------------------------------+-----------------------------------------------------------------------+ + | Operation | Description | + +===============================================================================================+=======================================================================+ + | :ref:`Preparing a CSV File ` | Prepare the CSV file to be loaded to the CarbonData Table. | + +-----------------------------------------------------------------------------------------------+-----------------------------------------------------------------------+ + | :ref:`Connecting to CarbonData ` | Connect to CarbonData before performing any operations on CarbonData. | + +-----------------------------------------------------------------------------------------------+-----------------------------------------------------------------------+ + | :ref:`Creating a CarbonData Table ` | Create a CarbonData table to load data and perform query operations. | + +-----------------------------------------------------------------------------------------------+-----------------------------------------------------------------------+ + | :ref:`Loading Data to a CarbonData Table ` | Load data from CSV to the created table. | + +-----------------------------------------------------------------------------------------------+-----------------------------------------------------------------------+ + | :ref:`Querying Data from a CarbonData Table ` | Perform query operations such as filters and groupby. | + +-----------------------------------------------------------------------------------------------+-----------------------------------------------------------------------+ + +.. _mrs_01_1406__section497653817420: + +Preparing a CSV File +-------------------- + +#. Prepare a CSV file named **test.csv** on the local PC. An example is as follows: + + .. code-block:: + + 13418592122,1001, MAC address, 2017-10-23 15:32:30,2017-10-24 15:32:30,62.50,74.56 + 13418592123 1002, MAC address, 2017-10-23 16:32:30,2017-10-24 16:32:30,17.80,76.28 + 13418592124,1003, MAC address, 2017-10-23 17:32:30,2017-10-24 17:32:30,20.40,92.94 + 13418592125 1004, MAC address, 2017-10-23 18:32:30,2017-10-24 18:32:30,73.84,8.58 + 13418592126,1005, MAC address, 2017-10-23 19:32:30,2017-10-24 19:32:30,80.50,88.02 + 13418592127 1006, MAC address, 2017-10-23 20:32:30,2017-10-24 20:32:30,65.77,71.24 + 13418592128,1007, MAC address, 2017-10-23 21:32:30,2017-10-24 21:32:30,75.21,76.04 + 13418592129,1008, MAC address, 2017-10-23 22:32:30,2017-10-24 22:32:30,63.30,94.40 + 13418592130, 1009, MAC address, 2017-10-23 23:32:30,2017-10-24 23:32:30,95.51,50.17 + 13418592131,1010, MAC address, 2017-10-24 00:32:30,2017-10-25 00:32:30,39.62,99.13 + +#. Use WinSCP to import the CSV file to the directory of the node where the client is installed, for example, **/opt**. + +#. Log in to FusionInsight Manager and choose **System**. In the navigation pane on the left, choose **Permission** > **User**, click **Create** to create human-machine user **sparkuser**, and add the user to user groups hadoop (primary group) and hive. + +#. Run the following commands to go to the client installation directory, load environment variables, and authenticate the user. + + **cd /**\ *Client installation directory* + + **source ./bigdata_env** + + **source ./Spark2x/component_env** + + **kinit sparkuser** + +#. .. _mrs_01_1406__li122143593123: + + Run the following command to upload the CSV file to the **/data** directory of the HDFS. + + **hdfs dfs -put /opt/test.csv /data/** + +.. _mrs_01_1406__s2af9b9318a0f44c48f3b0fa8217a12fe: + +Connecting to CarbonData +------------------------ + +- Use Spark SQL or Spark shell to connect to Spark and run Spark SQL commands. + +- Run the following commands to start the JDBCServer and use a JDBC client (for example, Spark Beeline) to connect to the JDBCServer. + + **cd ./Spark2x/spark/bin** + + **./spark-beeline** + +.. _mrs_01_1406__sffd808b54dab44fc8613a01cc8e39baf: + +Creating a CarbonData Table +--------------------------- + +After connecting Spark Beeline with the JDBCServer, create a CarbonData table to load data and perform query operations. Run the following commands to create a simple table: + +**create table x1 (imei string, deviceInformationId int, mac string, productdate timestamp, updatetime timestamp, gamePointId double, contractNumber double) STORED AS carbondata TBLPROPERTIES ('SORT_COLUMNS'='imei,mac');** + +The command output is as follows: + +.. code-block:: + + +---------+ + | Result | + +---------+ + +---------+ + No rows selected (1.093 seconds) + +.. _mrs_01_1406__s79b594787a2a46819cf07478d4a0d81c: + +Loading Data to a CarbonData Table +---------------------------------- + +After you have created a CarbonData table, you can load the data from CSV to the created table. + +Run the following command with required parameters to load data from CSV. The column names of the CarbonData table must match the column names of the CSV file. + +**LOAD DATA inpath 'hdfs://hacluster/data/**\ *test.csv*\ **' into table** *x1* **options('DELIMITER'=',', 'QUOTECHAR'='"','FILEHEADER'='imei, deviceinformationid,mac, productdate,updatetime, gamepointid,contractnumber');** + +**test.csv** is the CSV file prepared in :ref:`Preparing a CSV File ` and **x1** is the table name. + +The CSV example file is as follows: + +.. code-block:: + + 13418592122,1001, MAC address, 2017-10-23 15:32:30,2017-10-24 15:32:30,62.50,74.56 + 13418592123 1002, MAC address, 2017-10-23 16:32:30,2017-10-24 16:32:30,17.80,76.28 + 13418592124,1003, MAC address, 2017-10-23 17:32:30,2017-10-24 17:32:30,20.40,92.94 + 13418592125 1004, MAC address, 2017-10-23 18:32:30,2017-10-24 18:32:30,73.84,8.58 + 13418592126,1005, MAC address, 2017-10-23 19:32:30,2017-10-24 19:32:30,80.50,88.02 + 13418592127 1006, MAC address, 2017-10-23 20:32:30,2017-10-24 20:32:30,65.77,71.24 + 13418592128,1007, MAC address, 2017-10-23 21:32:30,2017-10-24 21:32:30,75.21,76.04 + 13418592129,1008, MAC address, 2017-10-23 22:32:30,2017-10-24 22:32:30,63.30,94.40 + 13418592130, 1009, MAC address, 2017-10-23 23:32:30,2017-10-24 23:32:30,95.51,50.17 + 13418592131,1010, MAC address, 2017-10-24 00:32:30,2017-10-25 00:32:30,39.62,99.13 + +The command output is as follows: + +.. code-block:: + + +------------+ + |Segment ID | + +------------+ + |0 | + +------------+ + No rows selected (3.039 seconds) + +.. _mrs_01_1406__s68e9413d1b234b2d91557a1739fc7828: + +Querying Data from a CarbonData Table +------------------------------------- + +After a CarbonData table is created and the data is loaded, you can perform query operations as required. Some query operations are provided as examples. + +- **Obtaining the number of records** + + Run the following command to obtain the number of records in the CarbonData table: + + **select count(*) from x1;** + +- **Querying with the groupby condition** + + Run the following command to obtain the **deviceinformationid** records without repetition in the CarbonData table: + + **select deviceinformationid,count (distinct deviceinformationid) from x1 group by deviceinformationid;** + +- **Querying with Filter** + + Run the following command to obtain specific **deviceinformationid** records: + + **select \* from x1 where deviceinformationid='1010';** + +.. note:: + + If the query result has non-English characters, the columns in the query result may not be aligned. This is because characters of different languages occupy different widths. + +Using CarbonData on Spark-shell +------------------------------- + +If you need to use CarbonData on a Spark-shell, you need to create a CarbonData table, load data to the CarbonData table, and query data in CarbonData as follows: + +.. code-block:: + + spark.sql("CREATE TABLE x2(imei string, deviceInformationId int, mac string, productdate timestamp, updatetime timestamp, gamePointId double, contractNumber double) STORED AS carbondata") + spark.sql("LOAD DATA inpath 'hdfs://hacluster/data/x1_without_header.csv' into table x2 options('DELIMITER'=',', 'QUOTECHAR'='\"','FILEHEADER'='imei, deviceinformationid,mac, productdate,updatetime, gamepointid,contractnumber')") + spark.sql("SELECT * FROM x2").show() diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_data_management/combining_segments.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_data_management/combining_segments.rst new file mode 100644 index 0000000..1ad54b2 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_data_management/combining_segments.rst @@ -0,0 +1,89 @@ +:original_name: mrs_01_1415.html + +.. _mrs_01_1415: + +Combining Segments +================== + +Scenario +-------- + +Frequent data access results in a large number of fragmented CarbonData files in the storage directory. In each data loading, data is sorted and indexing is performed. This means that an index is generated for each load. With the increase of data loading times, the number of indexes also increases. As each index works only on one loading, the performance of index is reduced. CarbonData provides loading and compression functions. In a compression process, data in each segment is combined and sorted, and multiple segments are combined into one large segment. + +Prerequisites +------------- + +Multiple data loadings have been performed. + +Operation Description +--------------------- + +There are three types of compaction: Minor, Major, and Custom. + +- Minor compaction: + + In minor compaction, you can specify the number of loads to be merged. If **carbon.enable.auto.load.merge** is set, minor compaction is triggered for every data load. If any segments are available to be merged, then compaction will run parallel with data load. + + There are two levels in minor compaction: + + - Level 1: Merging of the segments which are not yet compacted + - Level 2: Merging of the compacted segments again to form a larger segment + +- Major compaction: + + Multiple segments can be merged into one large segment. You can specify the compaction size so that all segments below the size will be merged. Major compaction is usually done during the off-peak time. + +- .. _mrs_01_1415__li68503712544: + + Custom compaction: + + In Custom compaction, you can specify the IDs of multiple segments to merge them into a large segment. The IDs of all the specified segments must exist and be valid. Otherwise, the compaction fails. Custom compaction is usually done during the off-peak time. + +For details, see :ref:`ALTER TABLE COMPACTION `. + +.. _mrs_01_1415__t9ba7557f991f4d6caad3710c4a51b9f2: + +.. table:: **Table 1** Compaction parameters + + +-----------------------------------------+-----------------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Application Type | Description | + +=========================================+=================+==================+====================================================================================================================================================================================================================================================================================================================================================+ + | carbon.enable.auto.load.merge | false | Minor | Whether to enable compaction along with data loading. | + | | | | | + | | | | **true**: Compaction is automatically triggered when data is loaded. | + | | | | | + | | | | **false**: Compaction is not triggered when data is loaded. | + +-----------------------------------------+-----------------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.compaction.level.threshold | 4,3 | Minor | This configuration is for minor compaction which decides how many segments to be merged. | + | | | | | + | | | | For example, if this parameter is set to **2,3**, minor compaction is triggered every two segments and segments form a single level 1 compacted segment. When the number of compacted level 1 segments reach 3, compaction is triggered again to merge them to form a single level 2 segment. | + | | | | | + | | | | The compaction policy depends on the actual data size and available resources. | + | | | | | + | | | | The value ranges from 0 to 100. | + +-----------------------------------------+-----------------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.major.compaction.size | 1024 MB | Major | The major compaction size can be configured using this parameter. Sum of the segments which is below this threshold will be merged. | + | | | | | + | | | | For example, if this parameter is set to 1024 MB, and there are five segments whose sizes are 300 MB, 400 MB, 500 MB, 200 MB, and 100 MB used for major compaction, only segments whose total size is less than this threshold are compacted. In this example, only the segments whose sizes are 300 MB, 400 MB, 200 MB, and 100 MB are compacted. | + +-----------------------------------------+-----------------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.numberof.preserve.segments | 0 | Minor/Major | If you want to preserve some number of segments from being compacted, then you can set this configuration. | + | | | | | + | | | | For example, if **carbon.numberof.preserve.segments** is set to **2**, the latest two segments will always be excluded from the compaction. | + | | | | | + | | | | By default, no segments are reserved. | + +-----------------------------------------+-----------------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.allowed.compaction.days | 0 | Minor/Major | This configuration is used to control on the number of recent segments that needs to be compacted. | + | | | | | + | | | | For example, if this parameter is set to **2**, the segments which are loaded in the time frame of past 2 days only will get merged. Segments which are loaded earlier than 2 days will not be merged. | + | | | | | + | | | | This configuration is disabled by default. | + +-----------------------------------------+-----------------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.number.of.cores.while.compacting | 2 | Minor/Major | Number of cores to be used while compacting data. The greater the number of cores, the better the compaction performance. If the CPU resources are sufficient, you can increase the value of this parameter. | + +-----------------------------------------+-----------------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.merge.index.in.segment | true | SEGMENT_INDEX | If this parameter is set to **true**, all the Carbon index (.carbonindex) files in a segment will be merged into a single Index (.carbonindexmerge) file. This enhances the first query performance. | + +-----------------------------------------+-----------------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Reference +--------- + +You are advised not to perform minor compaction on historical data. For details, see :ref:`How to Avoid Minor Compaction for Historical Data? `. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_data_management/deleting_segments.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_data_management/deleting_segments.rst new file mode 100644 index 0000000..1b7fc8f --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_data_management/deleting_segments.rst @@ -0,0 +1,101 @@ +:original_name: mrs_01_1414.html + +.. _mrs_01_1414: + +Deleting Segments +================= + +Scenario +-------- + +If you want to modify and reload the data because you have loaded wrong data into a table, or there are too many bad records, you can delete specific segments by segment ID or data loading time. + +.. note:: + + The segment deletion operation only deletes segments that are not compacted. You can run the **CLEAN FILES** command to clear compacted segments. + +Deleting a Segment by Segment ID +-------------------------------- + +Each segment has a unique ID. This segment ID can be used to delete the segment. + +#. Obtain the segment ID. + + Command: + + **SHOW SEGMENTS FOR Table** *dbname.tablename LIMIT number_of_loads;* + + Example: + + **SHOW SEGMENTS FOR TABLE** *carbonTable;* + + Run the preceding command to show all the segments of the table named **carbonTable**. + + **SHOW SEGMENTS FOR TABLE** *carbonTable LIMIT 2;* + + Run the preceding command to show segments specified by *number_of_loads*. + + The command output is as follows: + + .. code-block:: + + +-----+----------+--------------------------+------------------+------------+------------+-------------+--------------+--+ + | ID | Status | Load Start Time | Load Time Taken | Partition | Data Size | Index Size | File Format | + +-----+----------+--------------------------+------------------+------------+------------+-------------+--------------+--+ + | 3 | Success | 2020-09-28 22:53:26.336 | 3.726S | {} | 6.47KB | 3.30KB | columnar_v3 | + | 2 | Success | 2020-09-28 22:53:01.702 | 6.688S | {} | 6.47KB | 3.30KB | columnar_v3 | + +-----+----------+--------------------------+------------------+------------+------------+-------------+--------------+--+ + + .. note:: + + The output of the **SHOW SEGMENTS** command includes ID, Status, Load Start Time, Load Time Taken, Partition, Data Size, Index Size, and File Format. The latest loading information is displayed in the first line of the command output. + +#. Run the following command to delete the segment after you have found the Segment ID: + + Command: + + **DELETE FROM TABLE tableName WHERE SEGMENT.ID IN (load_sequence_id1, load_sequence_id2, ....)**; + + Example: + + **DELETE FROM TABLE carbonTable WHERE SEGMENT.ID IN (1,2,3)**; + + For details, see :ref:`DELETE SEGMENT by ID `. + +Deleting a Segment by Data Loading Time +--------------------------------------- + +You can delete a segment based on the loading time. + +Command: + +**DELETE FROM TABLE db_name.table_name WHERE SEGMENT.STARTTIME BEFORE date_value**; + +Example: + +**DELETE FROM TABLE carbonTable WHERE SEGMENT.STARTTIME BEFORE '2017-07-01 12:07:20'**; + +The preceding command can be used to delete all segments before 2017-07-01 12:07:20. + +For details, see :ref:`DELETE SEGMENT by DATE `. + +Result +------ + +Data of corresponding segments is deleted and is unavailable for query. You can run the **SHOW SEGMENTS** command to display the segment status and check whether the segment has been deleted. + +.. note:: + + - Segments are not physically deleted after the execution of the **DELETE SEGMENT** command. Therefore, if you run the **SHOW SEGMENTS** command to check the status of a deleted segment, it will be marked as **Marked for Delete**. If you run the **SELECT \* FROM tablename** command, the deleted segment will be excluded. + + - The deleted segment will be deleted physically only when the next data loading reaches the maximum query execution duration, which is configured by the **max.query.execution.time** parameter. The default value of the parameter is 60 minutes. + + - If you want to forcibly delete a physical segment file, run the **CLEAN FILES** command. + + Example: + + **CLEAN FILES FOR TABLE table1;** + + This command will physically delete the segment file in the **Marked for delete** state. + + If this command is executed before the time specified by **max.query.execution.time** arrives, the query may fail. **max.query.execution.time** indicates the maximum time allowed for a query, which is set in the **carbon.properties** file. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_data_management/index.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_data_management/index.rst new file mode 100644 index 0000000..e5d9749 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_data_management/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_1412.html + +.. _mrs_01_1412: + +CarbonData Table Data Management +================================ + +- :ref:`Loading Data ` +- :ref:`Deleting Segments ` +- :ref:`Combining Segments ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + loading_data + deleting_segments + combining_segments diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_data_management/loading_data.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_data_management/loading_data.rst new file mode 100644 index 0000000..c5c5070 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_data_management/loading_data.rst @@ -0,0 +1,11 @@ +:original_name: mrs_01_1413.html + +.. _mrs_01_1413: + +Loading Data +============ + +Scenario +-------- + +After a CarbonData table is created, you can run the **LOAD DATA** command to load data to the table for query. Once data loading is triggered, data is encoded in CarbonData format and files in multi-dimensional and column-based format are compressed and copied to the HDFS path of CarbonData files for quick analysis and queries. The HDFS path can be configured in the **carbon.properties** file. For details, see :ref:`Configuration Reference `. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/about_carbondata_table.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/about_carbondata_table.rst new file mode 100644 index 0000000..82b05e9 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/about_carbondata_table.rst @@ -0,0 +1,85 @@ +:original_name: mrs_01_1408.html + +.. _mrs_01_1408: + +About CarbonData Table +====================== + +Overview +-------- + +In CarbonData, data is stored in entities called tables. CarbonData tables are similar to RDBMS tables. RDBMS data is stored in a table consisting of rows and columns. CarbonData tables store structured data, and have fixed columns and data types. + +Supported Data Types +-------------------- + +CarbonData tables support the following data types: + +- Int +- String +- BigInt +- Smallint +- Char +- Varchar +- Boolean +- Decimal +- Double +- TimeStamp +- Date +- Array +- Struct +- Map + +The following table describes supported data types and their respective values range. + +.. table:: **Table 1** CarbonData data types + + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Type | Value Range | + +======================================================+======================================================================================================================================================================+ + | Int | 4-byte signed integer ranging from -2,147,483,648 to 2,147,483,647. | + | | | + | | .. note:: | + | | | + | | If a non-dictionary column is of the **int** data type, it is internally stored as the **BigInt** type. | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | String | 100,000 characters | + | | | + | | .. note:: | + | | | + | | If the **CHAR** or **VARCHAR** data type is used in **CREATE TABLE**, the two data types are automatically converted to the String data type. | + | | | + | | If a column contains more than 32,000 characters, add the column to the **LONG_STRING_COLUMNS** attribute of the **tblproperties** table during table creation. | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | BigInt | 64-bit value ranging from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SmallInt | -32,768 to 32,767 | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Char | A to Z and a to z | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Varchar | A to Z, a to z, and 0 to 9 | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Boolean | **true** or **false** | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Decimal | The default value is (10,0) and maximum value is (38,38). | + | | | + | | .. note:: | + | | | + | | When query with filters, append **BD** to the number to achieve accurate results. For example, **select \* from carbon_table where num = 1234567890123456.22BD**. | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Double | 64-bit value ranging from 4.9E-324 to 1.7976931348623157E308 | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | TimeStamp | The default format is **yyyy-MM-dd HH:mm:ss**. | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Date | The **DATE** data type is used to store calendar dates. The default format is **yyyy-MM-DD**. | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Array | N/A | + | | | + | | .. note:: | + | | | + | | Currently, only two layers of complex types can be nested. | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Struct | | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Map | | + +------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/creating_a_carbondata_table.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/creating_a_carbondata_table.rst new file mode 100644 index 0000000..b4c125a --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/creating_a_carbondata_table.rst @@ -0,0 +1,92 @@ +:original_name: mrs_01_1409.html + +.. _mrs_01_1409: + +Creating a CarbonData Table +=========================== + +Scenario +-------- + +A CarbonData table must be created to load and query data. You can run the **Create Table** command to create a table. This command is used to create a table using custom columns. + +Creating a Table with Self-Defined Columns +------------------------------------------ + +Users can create a table by specifying its columns and data types. + +Sample command: + +**CREATE TABLE** *IF NOT EXISTS productdb.productSalesTable (* + +*productNumber Int,* + +*productName String,* + +*storeCity String,* + +*storeProvince String,* + +*productCategory String,* + +*productBatch String,* + +*saleQuantity Int,* + +*revenue Int)* + +STORED AS *carbondata* + +*TBLPROPERTIES (* + +*'table_blocksize'='128');* + +The following table describes parameters of preceding commands. + +.. table:: **Table 1** Parameter description + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===================================================================================================================================================================================================================================+ + | productSalesTable | Table name. The table is used to load data for analysis. | + | | | + | | The table name consists of letters, digits, and underscores (_). | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | productdb | Database name. The database maintains logical connections with tables stored in it to identify and manage the tables. | + | | | + | | The database name consists of letters, digits, and underscores (_). | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | productName | Columns in the table. The columns are service entities for data analysis. | + | | | + | storeCity | The column name (field name) consists of letters, digits, and underscores (_). | + | | | + | storeProvince | | + | | | + | procuctCategory | | + | | | + | productBatch | | + | | | + | saleQuantity | | + | | | + | revenue | | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | table_blocksize | Indicates the block size of data files used by the CarbonData table, in MB. The value ranges from **1** to **2048**. The default value is **1024**. | + | | | + | | If **table_blocksize** is too small, a large number of small files will be generated when data is loaded. This may affect the performance of HDFS. | + | | | + | | If **table_blocksize** is too large, during data query, the amount of block data that matches the index is large, and some blocks contain a large number of blocklets, affecting read concurrency and lowering query performance. | + | | | + | | You are advised to set the block size based on the data volume. For example, set the block size to 256 MB for GB-level data, 512 MB for TB-level data, and 1024 MB for PB-level data. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. note:: + + - Measurement of all Integer data is processed and displayed using the **BigInt** data type. + - CarbonData parses data strictly. Any data that cannot be parsed is saved as **null** in the table. For example, if the user loads the **double** value (3.14) to the BigInt column, the data is saved as **null**. + - The Short and Long data types used in the **Create Table** command are shown as Smallint and BigInt in the **DESCRIBE** command, respectively. + - You can run the **DESCRIBE** command to view the table data size and table index size. + +Operation Result +---------------- + +Run the command to create a table. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/deleting_a_carbondata_table.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/deleting_a_carbondata_table.rst new file mode 100644 index 0000000..49dbd7c --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/deleting_a_carbondata_table.rst @@ -0,0 +1,33 @@ +:original_name: mrs_01_1410.html + +.. _mrs_01_1410: + +Deleting a CarbonData Table +=========================== + +Scenario +-------- + +You can run the **DROP TABLE** command to delete a table. After a CarbonData table is deleted, its metadata and loaded data are deleted together. + +Procedure +--------- + +Run the following command to delete a CarbonData table: + +Run the following command: + +**DROP TABLE** *[IF EXISTS] [db_name.]table_name;* + +Once this command is executed, the table is deleted from the system. In the command, **db_name** is an optional parameter. If **db_name** is not specified, the table named **table_name** in the current database is deleted. + +Example: + +**DROP TABLE** *productdb.productSalesTable;* + +Run the preceding command to delete the **productSalesTable** table from the **productdb** database. + +Operation Result +---------------- + +Deletes the table specified in the command from the system. After the table is deleted, you can run the **SHOW TABLES** command to check whether the table is successfully deleted. For details, see :ref:`SHOW TABLES `. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/index.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/index.rst new file mode 100644 index 0000000..731323d --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_1407.html + +.. _mrs_01_1407: + +CarbonData Table Management +=========================== + +- :ref:`About CarbonData Table ` +- :ref:`Creating a CarbonData Table ` +- :ref:`Deleting a CarbonData Table ` +- :ref:`Modify the CarbonData Table ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + about_carbondata_table + creating_a_carbondata_table + deleting_a_carbondata_table + modify_the_carbondata_table diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/modify_the_carbondata_table.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/modify_the_carbondata_table.rst new file mode 100644 index 0000000..7dd8ee5 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/carbondata_table_management/modify_the_carbondata_table.rst @@ -0,0 +1,41 @@ +:original_name: mrs_01_1411.html + +.. _mrs_01_1411: + +Modify the CarbonData Table +=========================== + +**SET** and **UNSET** +--------------------- + +When the **SET** command is executed, the new properties overwrite the existing ones. + +- SORT SCOPE + + The following is an example of the **SET SORT SCOPE** command: + + **ALTER TABLE** *tablename* **SET TBLPROPERTIES('SORT_SCOPE'**\ =\ *'no_sort'*) + + After running the **UNSET SORT SCOPE** command, the default value **NO_SORT** is adopted. + + The following is an example of the **UNSET SORT SCOPE** command: + + **ALTER TABLE** *tablename* **UNSET TBLPROPERTIES('SORT_SCOPE'**) + +- SORT COLUMNS + + The following is an example of the **SET SORT COLUMNS** command: + + **ALTER TABLE** *tablename* **SET TBLPROPERTIES('SORT_COLUMNS'**\ =\ *'column1'*) + + After this command is executed, the new value of **SORT_COLUMNS** is used. Users can adjust the **SORT_COLUMNS** based on the query results, but the original data is not affected. The operation does not affect the query performance of the original data segments which are not sorted by new **SORT_COLUMNS**. + + The **UNSET** command is not supported, but the **SORT_COLUMNS** can be set to empty string instead of using the **UNSET** command. + + **ALTER TABLE** *tablename* **SET TBLPROPERTIES('SORT_COLUMNS'**\ =\ *''*) + + .. note:: + + - The later version will enhance custom compaction to resort the old segments. + - The value of **SORT_COLUMNS** cannot be modified in the streaming table. + - If the **inverted index** column is removed from **SORT_COLUMNS**, **inverted index** will not be created in this column. However, the old configuration of **INVERTED_INDEX** will be kept. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/index.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/index.rst new file mode 100644 index 0000000..bce412a --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/index.rst @@ -0,0 +1,22 @@ +:original_name: mrs_01_1405.html + +.. _mrs_01_1405: + +CarbonData Operation Guide +========================== + +- :ref:`CarbonData Quick Start ` +- :ref:`CarbonData Table Management ` +- :ref:`CarbonData Table Data Management ` +- :ref:`CarbonData Data Migration ` +- :ref:`Migrating Data on CarbonData from Spark 1.5 to Spark2x ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + carbondata_quick_start + carbondata_table_management/index + carbondata_table_data_management/index + carbondata_data_migration + migrating_data_on_carbondata_from_spark_1.5_to_spark2x diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/migrating_data_on_carbondata_from_spark_1.5_to_spark2x.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/migrating_data_on_carbondata_from_spark_1.5_to_spark2x.rst new file mode 100644 index 0000000..db23645 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_operation_guide/migrating_data_on_carbondata_from_spark_1.5_to_spark2x.rst @@ -0,0 +1,70 @@ +:original_name: mrs_01_2301.html + +.. _mrs_01_2301: + +Migrating Data on CarbonData from Spark 1.5 to Spark2x +====================================================== + +Migration Solution Overview +--------------------------- + +This migration guides you to migrate the CarbonData table data of Spark 1.5 to that of Spark2x. + +.. note:: + + Before performing this operation, you need to stop the data import service of the CarbonData table in Spark 1.5 and migrate data to the CarbonData table of Spark2x at a time. After the migration is complete, use Spark2x to perform service operations. + +Migration roadmap: + +#. Use Spark 1.5 to migrate historical data to the intermediate table. +#. Use Spark2x to migrate data from the intermediate table to the target table and change the target table name to the original table name. +#. After the migration is complete, use Spark2x to operate data in the CarbonData table. + +Migration Solution and Commands +------------------------------- + +**Migrating Historical Data** + +#. Stop the CarbonData data import service, use spark-beeline of Spark 1.5 to view the ID and time of the latest segment in the CarbonData table, and record the segment ID. + + **show segments for table dbname.tablename;** + +#. .. _mrs_01_2301__li1092003116117: + + Run spark-beeline of Spark 1.5 as the user who has created the original CarbonData table to create an intermediate table in ORC or Parquet format. Then import the data in the original CarbonData table to the intermediate table. After the import is complete, the services of the CarbonData table can be restored. + + Create an ORC table. + + **CREATE TABLE dbname.mid_tablename_orc STORED AS ORC as select \* from dbname.tablename;** + + Create a Parquet table. + + **CREATE TABLE dbname.mid_tablename_parq STORED AS PARQUET as select \* from dbname.tablename;** + + In the preceding command, **dbname** indicates the database name and **tablename** indicates the name of the original CarbonData table. + +#. .. _mrs_01_2301__li192112311210: + + Run spark-beeline of Spark2x as the user who has created the original CarbonData table. Run the table creation statement of the old table to create a CarbonData table. + + .. note:: + + In the statement for creating a new table, the field sequence and type must be the same as those of the old table. In this way, the index column structure of the old table can be retained, which helps avoid errors caused by the use of **select \*** statement during data insertion. + + Run the spark-beeline command of Spark 1.5 to view the table creation statement of the old table: **SHOW CREATE TABLE dbname.tablename;** + + Create a CarbonData table named **dbname.new_tablename**. + +#. Run spark-beeline of Spark2x as the user who has created the original CarbonData table to load the intermediate table data in ORC (or PARQUET) format created in :ref:`2 ` to the new table created in :ref:`3 `. This step may take a long time (about 2 hours for 200 GB data). The following uses the ORC intermediate table as an example to describe the command for loading data: + + **insert into dbname.new_tablename select \*** + + **from dbname. mid_tablename_orc;** + +#. Run spark-beeline of Spark2x as the user who has created the original CarbonData table to query and verify the data in the new table. If the data is correct, change the name of the original CarbonData table and then change the name of the new CarbonData table to the name of the original one. + + **ALTER TABLE dbname.tablename RENAME TO dbname.old_tablename;** + + **ALTER TABLE dbname.new_tablename RENAME TO dbname.tablename;** + +#. Complete the migration. In this case, you can use Spark2x to query the new table and rebuild the secondary index. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_performance_tuning/configurations_for_performance_tuning.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_performance_tuning/configurations_for_performance_tuning.rst new file mode 100644 index 0000000..057ae3a --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_performance_tuning/configurations_for_performance_tuning.rst @@ -0,0 +1,136 @@ +:original_name: mrs_01_1421.html + +.. _mrs_01_1421: + +Configurations for Performance Tuning +===================================== + +Scenario +-------- + +This section describes the configurations that can improve CarbonData performance. + +Procedure +--------- + +:ref:`Table 1 ` and :ref:`Table 2 ` describe the configurations about query of CarbonData. + +.. _mrs_01_1421__t21812cb0660b4dc8b7b03d48f5b8e23e: + +.. table:: **Table 1** Number of tasks started for the shuffle process + + +----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | spark.sql.shuffle.partitions | + +----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration File | spark-defaults.conf | + +----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Function | Data query | + +----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scenario Description | Number of tasks started for the shuffle process in Spark | + +----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tuning | You are advised to set this parameter to one to two times as much as the executor cores. In an aggregation scenario, reducing the number from 200 to 32 can reduce the query time by two folds. | + +----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _mrs_01_1421__t8a9249ffd966446e9bfade15a686addd: + +.. table:: **Table 2** Number of executors and vCPUs, and memory size used for CarbonData data query + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | spark.executor.cores | + | | | + | | spark.executor.instances | + | | | + | | spark.executor.memory | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration File | spark-defaults.conf | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Function | Data query | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scenario Description | Number of executors and vCPUs, and memory size used for CarbonData data query | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tuning | In the bank scenario, configuring 4 vCPUs and 15 GB memory for each executor will achieve good performance. The two values do not mean the more the better. Configure the two values properly in case of limited resources. If each node has 32 vCPUs and 64 GB memory in the bank scenario, the memory is not sufficient. If each executor has 4 vCPUs and 12 GB memory, Garbage Collection may occur during query, time spent on query from increases from 3s to more than 15s. In this case, you need to increase the memory or reduce the number of vCPUs. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +:ref:`Table 3 `, :ref:`Table 4 `, and :ref:`Table 5 ` describe the configurations for CarbonData data loading. + +.. _mrs_01_1421__t237c47d9db1c411eaf39aa4d920be2f3: + +.. table:: **Table 3** Number of vCPUs used for data loading + + +----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | carbon.number.of.cores.while.loading | + +----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration File | carbon.properties | + +----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Function | Data loading | + +----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scenario Description | Number of vCPUs used for data processing during data loading in CarbonData | + +----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tuning | If there are sufficient CPUs, you can increase the number of vCPUs to improve performance. For example, if the value of this parameter is changed from 2 to 4, the CSV reading performance can be doubled. | + +----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _mrs_01_1421__tcac192b5a3174a15b095684ff1ed0f80: + +.. table:: **Table 4** Whether to use Yarn local directories for multi-disk data loading + + +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | carbon.use.local.dir | + +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration File | carbon.properties | + +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Function | Data loading | + +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scenario Description | Whether to use Yarn local directories for multi-disk data loading | + +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tuning | If this parameter is set to **true**, CarbonData uses local Yarn directories for multi-table load disk load balance, improving data loading performance. | + +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _mrs_01_1421__tc570296297a34147bc0c5800bff5ef56: + +.. table:: **Table 5** Whether to use multiple directories during loading + + +----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | carbon.use.multiple.temp.dir | + +----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration File | carbon.properties | + +----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Function | Data loading | + +----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scenario Description | Whether to use multiple temporary directories to store temporary sort files | + +----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tuning | If this parameter is set to **true**, multiple temporary directories are used to store temporary sort files during data loading. This configuration improves data loading performance and prevents single points of failure (SPOFs) on disks. | + +----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +:ref:`Table 6 ` describes the configurations for CarbonData data loading and query. + +.. _mrs_01_1421__taf36a94822c9418ebec5d418fa2cce2e: + +.. table:: **Table 6** Number of vCPUs used for data loading and query + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | carbon.compaction.level.threshold | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration File | carbon.properties | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Function | Data loading and query | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scenario Description | For minor compaction, specifies the number of segments to be merged in stage 1 and number of compacted segments to be merged in stage 2. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tuning | Each CarbonData load will create one segment, if every load is small in size, it will generate many small files over a period of time impacting the query performance. Configuring this parameter will merge the small segments to one big segment which will sort the data and improve the performance. | + | | | + | | The compaction policy depends on the actual data size and available resources. For example, a bank loads data once a day and at night when no query is performed. If resources are sufficient, the compaction policy can be 6 or 5. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. table:: **Table 7** Whether to enable data pre-loading when the index cache server is used + + +----------------------+--------------------------------------------------------------------------------------------------------------------+ + | Parameter | carbon.indexserver.enable.prepriming | + +----------------------+--------------------------------------------------------------------------------------------------------------------+ + | Configuration File | carbon.properties | + +----------------------+--------------------------------------------------------------------------------------------------------------------+ + | Function | Data loading | + +----------------------+--------------------------------------------------------------------------------------------------------------------+ + | Scenario Description | Enabling data pre-loading during the use of the index cache server can improve the performance of the first query. | + +----------------------+--------------------------------------------------------------------------------------------------------------------+ + | Tuning | You can set this parameter to **true** to enable the pre-loading function. The default value is **false**. | + +----------------------+--------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_performance_tuning/index.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_performance_tuning/index.rst new file mode 100644 index 0000000..3eda8a2 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_performance_tuning/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_1417.html + +.. _mrs_01_1417: + +CarbonData Performance Tuning +============================= + +- :ref:`Tuning Guidelines ` +- :ref:`Suggestions for Creating CarbonData Tables ` +- :ref:`Configurations for Performance Tuning ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + tuning_guidelines + suggestions_for_creating_carbondata_tables + configurations_for_performance_tuning diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_performance_tuning/suggestions_for_creating_carbondata_tables.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_performance_tuning/suggestions_for_creating_carbondata_tables.rst new file mode 100644 index 0000000..65c7fac --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_performance_tuning/suggestions_for_creating_carbondata_tables.rst @@ -0,0 +1,129 @@ +:original_name: mrs_01_1419.html + +.. _mrs_01_1419: + +Suggestions for Creating CarbonData Tables +========================================== + +Scenario +-------- + +This section provides suggestions based on more than 50 test cases to help you create CarbonData tables with higher query performance. + +.. table:: **Table 1** Columns in the CarbonData table + + =========== ============= =========== =========== + Column name Data type Cardinality Attribution + =========== ============= =========== =========== + msisdn String 30 million dimension + BEGIN_TIME bigint 10,000 dimension + host String 1 million dimension + dime_1 String 1,000 dimension + dime_2 String 500 dimension + dime_3 String 800 dimension + counter_1 numeric(20,0) NA measure + ... ... NA measure + counter_100 numeric(20,0) NA measure + =========== ============= =========== =========== + +Procedure +--------- + +- If the to-be-created table contains a column that is frequently used for filtering, for example, this column is used in more than 80% of filtering scenarios, + + implement optimization as follows: + + Place this column in the first column of **sort_columns**. + + For example, if **msisdn** is the most frequently used filter criterion in a query, it is placed in the first column. Run the following command to create a table. The query performance is good if **msisdn** is used as the filter condition. + + .. code-block:: + + create table carbondata_table( + msisdn String, + ... + )STORED AS carbondata TBLPROPERTIES ('SORT_COLUMS'='msisdn'); + +- If the to-be-created table has multiple columns which are frequently used to filter the results, + + implement optimization as follows: + + Create an index for the columns. + + For example, if **msisdn**, **host**, and **dime_1** are frequently used columns, the **sort_columns** column sequence is "dime_1-> host-> msisdn..." based on cardinality. Run the following command to create a table. The following command can improve the filtering performance of **dime_1**, **host**, and **msisdn**. + + .. code-block:: + + create table carbondata_table( + dime_1 String, + host String, + msisdn String, + dime_2 String, + dime_3 String, + ... + )STORED AS carbondata + TBLPROPERTIES ('SORT_COLUMS'='dime_1,host,msisdn'); + +- If the frequency of each column used for filtering is similar, + + implement optimization as follows: + + **sort_columns** is sorted in ascending order of cardinality. + + Run the following command to create a table: + + .. code-block:: + + create table carbondata_table( + Dime_1 String, + BEGIN_TIME bigint, + HOST String, + MSISDN String, + ... + )STORED AS carbondata + TBLPROPERTIES ('SORT_COLUMS'='dime_2,dime_3,dime_1, BEGIN_TIME,host,msisdn'); + +- Create tables in ascending order of cardinalities. Then create secondary indexes for columns with more cardinalities. The statement for creating an index is as follows: + + .. code-block:: + + create index carbondata_table_index_msidn on tablecarbondata_table ( + MSISDN String) as 'carbondata' PROPERTIES ('table_blocksize'='128'); + create index carbondata_table_index_host on tablecarbondata_table ( + host String) as 'carbondata' PROPERTIES ('table_blocksize'='128'); + +- For columns of measure type, not requiring high accuracy, the numeric (20,0) data type is not required. You are advised to use the double data type to replace the numeric (20,0) data type to enhance query performance. + + The result of performance analysis of test-case shows reduction in query execution time from 15 to 3 seconds, thereby improving performance by nearly 5 times. The command for creating a table is as follows: + + .. code-block:: + + create table carbondata_table( + Dime_1 String, + BEGIN_TIME bigint, + HOST String, + MSISDN String, + counter_1 double, + counter_2 double, + ... + counter_100 double, + )STORED AS carbondata + ; + +- If values (**start_time** for example) of a column are incremental: + + For example, if data is loaded to CarbonData every day, **start_time** is incremental for each load. In this case, it is recommended that the **start_time** column be put at the end of **sort_columns**, because incremental values are efficient in using min/max index. The command for creating a table is as follows: + + .. code-block:: + + create table carbondata_table( + Dime_1 String, + HOST String, + MSISDN String, + counter_1 double, + counter_2 double, + BEGIN_TIME bigint, + ... + counter_100 double, + )STORED AS carbondata + TBLPROPERTIES ( 'SORT_COLUMS'='dime_2,dime_3,dime_1..BEGIN_TIME'); diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_performance_tuning/tuning_guidelines.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_performance_tuning/tuning_guidelines.rst new file mode 100644 index 0000000..e98ccc7 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_performance_tuning/tuning_guidelines.rst @@ -0,0 +1,79 @@ +:original_name: mrs_01_1418.html + +.. _mrs_01_1418: + +Tuning Guidelines +================= + +Query Performance Tuning +------------------------ + +There are various parameters that can be tuned to improve the query performance in CarbonData. Most of the parameters focus on increasing the parallelism in processing and optimizing system resource usage. + +- Spark executor count: Executors are basic entities of parallelism in Spark. Raising the number of executors can increase the amount of parallelism in the cluster. For details about how to configure the number of executors, see the Spark documentation. +- Executor core: The number of concurrent tasks that an executor can run are controlled in each executor. Increasing the number of executor cores will add more concurrent processing tasks to improve performance. +- HDFS block size: CarbonData assigns query tasks by allocating different blocks to different executors for processing. HDFS block is the partition unit. CarbonData maintains a global block level index in Spark driver, which helps to reduce the quantity of blocks that need to be scanned for a query. Higher block size means higher I/O efficiency and lower global index efficiency. Reversely, lower block size means lower I/O efficiency, higher global index efficiency, and greater memory consumption. +- Number of scanner threads: Scanner threads control the number of parallel data blocks that are processed by each task. By increasing the number of scanner threads, you can increase the number of data blocks that are processed in parallel to improve performance. The **carbon.number.of.cores** parameter in the **carbon.properties** file is used to configure the number of scanner threads. For example, **carbon.number.of.cores = 4**. +- B-Tree caching: The cache memory can be optimized using the B-Tree least recently used (LRU) caching. In the driver, the B-Tree LRU caching configuration helps free up the cache by releasing table segments which are not accessed or not used. Similarly, in the executor, the B-Tree LRU caching configuration will help release table blocks that are not accessed or used. For details, see the description of **carbon.max.driver.lru.cache.size** and **carbon.max.executor.lru.cache.size** in :ref:`Table 2 `. + +CarbonData Query Process +------------------------ + +When CarbonData receives a table query task, for example query for table A, the index data of table A will be loaded to the memory for the query process. When CarbonData receives a query task for table A again, the system does not need to load the index data of table A. + +When a query is performed in CarbonData, the query task is divided into several scan tasks, namely, task splitting based on HDFS blocks. Scan tasks are executed by executors on the cluster. Tasks can run in parallel, partially parallel, or in sequence, depending on the number of executors and configured number of executor cores. + +Some parts of a query task can be processed at the individual task level, such as **select** and **filter.** Some parts of a query task can be processed at the individual task level, such as **group-by**, **count**, and **distinct count**. + +Some operations cannot be performed at the task level, such as **Having Clause** (filter after grouping) and **sort**. Operations which cannot be performed at the task level or can be only performed partially at the task level require data (partial results) transmission across executors on the cluster. The transmission operation is called shuffle. + +The more the tasks are, the more data needs to be shuffled. This affects query performance. + +The number of tasks is depending on the number of HDFS blocks and the number of blocks is depending on the size of each block. You are advised to configure proper HDFS block size to achieve a balance among increased parallelism, the amount of data to be shuffled, and the size of aggregate tables. + +Relationship Between Splits and Executors +----------------------------------------- + +If the number of splits is less than or equal to the executor count multiplied by the executor core count, the tasks are run in parallel. Otherwise, some tasks can start only after other tasks are complete. Therefore, ensure that the executor count multiplied by executor cores is greater than or equal to the number of splits. In addition, make sure that there are sufficient splits so that a query task can be divided into sufficient subtasks to ensure concurrency. + +Configuring Scanner Threads +--------------------------- + +The scanner threads property decides the number of data blocks to be processed. If there are too many data blocks, a large number of small data blocks will be generated, affecting performance. If there are few data blocks, the parallelism is poor and the performance is affected. Therefore, when determining the number of scanner threads, you are advised to consider the average data size within a partition and select a value that makes the data block not small. Based on experience, you are advised to divide a single block size (unit: MB) by 250 and use the result as the number of scanner threads. + +The number of actual available vCPUs is an important factor to consider when you want to increase the parallelism. The number of vCPUs that conduct parallel computation must not exceed 75% to 80% of actual vCPUs. + +The number of vCPUs is approximately equal to: + +Number of parallel tasks x Number of scanner threads. Number of parallel tasks is the smaller value of number of splits or executor count x executor cores. + +Data Loading Performance Tuning +------------------------------- + +Tuning of data loading performance is different from that of query performance. Similar to query performance, data loading performance depends on the amount of parallelism that can be achieved. In case of data loading, the number of worker threads decides the unit of parallelism. Therefore, more executors mean more executor cores and better data loading performance. + +To achieve better performance, you can configure the following parameters in HDFS. + +.. table:: **Table 1** HDFS configuration + + ===================================== ================= + Parameter Recommended Value + ===================================== ================= + dfs.datanode.drop.cache.behind.reads false + dfs.datanode.drop.cache.behind.writes false + dfs.datanode.sync.behind.writes true + ===================================== ================= + +Compression Tuning +------------------ + +CarbonData uses a few lightweight compression and heavyweight compression algorithms to compress data. Although these algorithms can process any type of data, the compression performance is better if the data is ordered with similar values being together. + +During data loading, data is sorted based on the order of columns in the table to achieve good compression performance. + +Since CarbonData sorts data in the order of columns defined in the table, the order of columns plays an important role in the effectiveness of compression. If the low cardinality dimension is on the left, the range of data partitions after sorting is small and the compression efficiency is high. If a high cardinality dimension is on the left, a range of data partitions obtained after sorting is relatively large, and compression efficiency is relatively low. + +Memory Tuning +------------- + +CarbonData provides a mechanism for memory tuning where data loading depends on the columns needed in the query. Whenever a query command is received, columns required by the query are fetched and data is loaded for those columns in memory. During this operation, if the memory threshold is reached, the least used loaded files are deleted to release memory space for columns required by the query. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/api.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/api.rst new file mode 100644 index 0000000..c49c912 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/api.rst @@ -0,0 +1,136 @@ +:original_name: mrs_01_1450.html + +.. _mrs_01_1450: + +API +=== + +This section describes the APIs and usage methods of Segment. All methods are in the org.apache.spark.util.CarbonSegmentUtil class. + +The following methods have been abandoned: + +.. code-block:: + + /** + * Returns the valid segments for the query based on the filter condition + * present in carbonScanRdd. + * + * @param carbonScanRdd + * @return Array of valid segments + */ + @deprecated def getFilteredSegments(carbonScanRdd: CarbonScanRDD[InternalRow]): Array[String]; + +Usage Method +------------ + +Use the following methods to obtain CarbonScanRDD from the query statement: + +.. code-block:: + + val df=carbon.sql("select * from table where age='12'") + val myscan=df.queryExecution.sparkPlan.collect { + case scan: CarbonDataSourceScan if scan.rdd.isInstanceOf[CarbonScanRDD[InternalRow]] => scan.rdd + case scan: RowDataSourceScanExec if scan.rdd.isInstanceOf[CarbonScanRDD[InternalRow]] => scan.rdd + }.head + val carbonrdd=myscan.asInstanceOf[CarbonScanRDD[InternalRow]] + +Example: + +.. code-block:: + + CarbonSegmentUtil.getFilteredSegments(carbonrdd) + +The filtered segment can be obtained by importing SQL statements. + +.. code-block:: + + /** + * Returns an array of valid segment numbers based on the filter condition provided in the sql + * NOTE: This API is supported only for SELECT Sql (insert into,ctas,.., is not supported) + * + * @param sql + * @param sparkSession + * @return Array of valid segments + * @throws UnsupportedOperationException because Get Filter Segments API supports if and only + * if only one carbon main table is present in query. + */ + def getFilteredSegments(sql: String, sparkSession: SparkSession): Array[String]; + +Example: + +.. code-block:: + + CarbonSegmentUtil.getFilteredSegments("select * from table where age='12'", sparkSession) + +Import the database name and table name to obtain the list of segments to be merged. The obtained segments can be used as parameters of the getMergedLoadName function. + +.. code-block:: + + /** + * Identifies all segments which can be merged with MAJOR compaction type. + * NOTE: This result can be passed to getMergedLoadName API to get the merged load name. + * + * @param sparkSession + * @param tableName + * @param dbName + * @return list of LoadMetadataDetails + */ + def identifySegmentsToBeMerged(sparkSession: SparkSession, + tableName: String, + dbName: String) : util.List[LoadMetadataDetails]; + +Example: + +.. code-block:: + + CarbonSegmentUtil.identifySegmentsToBeMerged(sparkSession, "table_test","default") + +Import the database name, table name, and obtain all segments which can be merged with CUSTOM compaction type. The obtained segments can be transferred as the parameter of the getMergedLoadName function. + +.. code-block:: + + /** + * Identifies all segments which can be merged with CUSTOM compaction type. + * NOTE: This result can be passed to getMergedLoadName API to get the merged load name. + * + * @param sparkSession + * @param tableName + * @param dbName + * @param customSegments + * @return list of LoadMetadataDetails + * @throws UnsupportedOperationException if customSegments is null or empty. + * @throws MalformedCarbonCommandException if segment does not exist or is not valid + */ + def identifySegmentsToBeMergedCustom(sparkSession: SparkSession, + tableName: String, + dbName: String, + customSegments: util.List[String]): util.List[LoadMetadataDetails]; + +Example: + +.. code-block:: + + val customSegments = new util.ArrayList[String]() + customSegments.add("1") + customSegments.add("2") + CarbonSegmentUtil.identifySegmentsToBeMergedCustom(sparkSession, "table_test","default", customSegments) + +If a segment list is specified, the merged load name is returned. + +.. code-block:: + + /** + * Returns the Merged Load Name for given list of segments + * + * @param list of segments + * @return Merged Load Name + * @throws UnsupportedOperationException if list of segments is less than 1 + */ + def getMergedLoadName(list: util.List[LoadMetadataDetails]): String; + +Example: + +.. code-block:: + + val carbonTable = CarbonEnv.getCarbonTable(Option(databaseName), tableName)(sparkSession) + val loadMetadataDetails = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath) CarbonSegmentUtil.getMergedLoadName(loadMetadataDetails.toList.asJava) diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/add_columns.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/add_columns.rst new file mode 100644 index 0000000..9ef3450 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/add_columns.rst @@ -0,0 +1,54 @@ +:original_name: mrs_01_1431.html + +.. _mrs_01_1431: + +ADD COLUMNS +=========== + +Function +-------- + +This command is used to add a column to an existing table. + +Syntax +------ + +**ALTER TABLE** *[db_name.]table_name* **ADD COLUMNS** *(col_name data_type,...)* **TBLPROPERTIES**\ *(''COLUMNPROPERTIES.columnName.shared_column'='sharedFolder.sharedColumnName,...', 'DEFAULT.VALUE.COLUMN_NAME'='default_value')*; + +Parameter Description +--------------------- + +.. table:: **Table 1** ADD COLUMNS parameters + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===================================================================================================================================================================================+ + | db_name | Database name. If this parameter is not specified, the current database is selected. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | table_name | Table name. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | col_name data_type | Name of a comma-separated column with a data type. It consists of letters, digits, and underscores (_). | + | | | + | | .. note:: | + | | | + | | When creating a CarbonData table, do not name columns as tupleId, PositionId, and PositionReference because they will be used in UPDATE, DELETE, and secondary index commands. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Precautions +----------- + +- Only **shared_column** and **default_value** are read. If any other property name is specified, no error will be thrown and the property will be ignored. +- If no default value is specified, the default value of the new column is considered null. +- If filter is applied to the column, new columns will not be added during sort. New columns may affect query performance. + +Examples +-------- + +- **ALTER TABLE** *carbon* **ADD COLUMNS** *(a1 INT, b1 STRING)*; +- **ALTER TABLE** *carbon* **ADD COLUMNS** *(a1 INT, b1 STRING)* **TBLPROPERTIES**\ *('COLUMNPROPERTIES.b1.shared_column'='sharedFolder.b1')*; +- ALTER TABLE *carbon* **ADD COLUMNS** *(a1 INT, b1 STRING)* **TBLPROPERTIES**\ *('DEFAULT.VALUE.a1'='10')*; + +System Response +--------------- + +The newly added column can be displayed by running the **DESCRIBE** command. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/alter_table_compaction.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/alter_table_compaction.rst new file mode 100644 index 0000000..765d787 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/alter_table_compaction.rst @@ -0,0 +1,85 @@ +:original_name: mrs_01_1429.html + +.. _mrs_01_1429: + +ALTER TABLE COMPACTION +====================== + +Function +-------- + +The **ALTER TABLE COMPACTION** command is used to merge a specified number of segments into a single segment. This improves the query performance of a table. + +Syntax +------ + +**ALTER TABLE**\ *[db_name.]table_name COMPACT 'MINOR/MAJOR/SEGMENT_INDEX';* + +**ALTER TABLE**\ *[db_name.]table_name COMPACT 'CUSTOM' WHERE SEGMENT.ID IN (id1, id2, ...);* + +Parameter Description +--------------------- + +.. table:: **Table 1** ALTER TABLE COMPACTION parameters + + +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===============+===================================================================================================================================================================================================================================================================================================================+ + | db_name | Database name. If this parameter is not specified, the current database is selected. | + +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | table_name | Table name. | + +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | MINOR | Minor compaction. For details, see :ref:`Combining Segments `. | + +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | MAJOR | Major compaction. For details, see :ref:`Combining Segments `. | + +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SEGMENT_INDEX | This configuration enables you to merge all the CarbonData index files (**.carbonindex**) inside a segment to a single CarbonData index merge file (**.carbonindexmerge**). This enhances the first query performance. For more information, see :ref:`Table 1 `. | + +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | CUSTOM | Custom compaction. For details, see :ref:`Combining Segments `. | + +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Precautions +----------- + +N/A + +Examples +-------- + +**ALTER TABLE ProductDatabase COMPACT 'MINOR';** + +**ALTER TABLE ProductDatabase COMPACT 'MAJOR';** + +**ALTER TABLE ProductDatabase COMPACT 'SEGMENT_INDEX';** + +**ALTER TABLE ProductDatabase COMPACT 'CUSTOM' WHERE SEGMENT.ID IN (0, 1);** + +System Response +--------------- + +**ALTER TABLE COMPACTION** does not show the response of the compaction because it is run in the background. + +If you want to view the response of minor and major compactions, you can check the logs or run the **SHOW SEGMENTS** command. + +Example: + +.. code-block:: + + +------+------------+--------------------------+------------------+------------+------------+-------------+--------------+--+ + | ID | Status | Load Start Time | Load Time Taken | Partition | Data Size | Index Size | File Format | + +------+------------+--------------------------+------------------+------------+------------+-------------+--------------+--+ + | 3 | Success | 2020-09-28 22:53:26.336 | 3.726S | {} | 6.47KB | 3.30KB | columnar_v3 | + | 2 | Success | 2020-09-28 22:53:01.702 | 6.688S | {} | 6.47KB | 3.30KB | columnar_v3 | + | 1 | Compacted | 2020-09-28 22:51:15.242 | 5.82S | {} | 6.50KB | 3.43KB | columnar_v3 | + | 0.1 | Success | 2020-10-30 20:49:24.561 | 16.66S | {} | 12.87KB | 6.91KB | columnar_v3 | + | 0 | Compacted | 2020-09-28 22:51:02.6 | 6.819S | {} | 6.50KB | 3.43KB | columnar_v3 | + +------+------------+--------------------------+------------------+------------+------------+-------------+--------------+--+ + +In the preceding information: + +- **Compacted** indicates that data has been compacted. +- **0.1** indicates the compacting result of segment 0 and segment 1. + +The compact operation does not incur any change to other operations. + +Compacted segments, such as segment 0 and segment 1, become useless. To save space, before you perform other operations, run the **CLEAN FILES** command to delete compacted segments. For more information about the **CLEAN FILES** command, see :ref:`CLEAN FILES `. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/change_data_type.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/change_data_type.rst new file mode 100644 index 0000000..f7e4544 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/change_data_type.rst @@ -0,0 +1,61 @@ +:original_name: mrs_01_1433.html + +.. _mrs_01_1433: + +CHANGE DATA TYPE +================ + +Function +-------- + +This command is used to change the data type from INT to BIGINT or decimal precision from lower to higher. + +Syntax +------ + +**ALTER TABLE** *[db_name.]table_name* **CHANGE** *col_name col_name changed_column_type*; + +Parameter Description +--------------------- + +.. table:: **Table 1** CHANGE DATA TYPE parameters + + +---------------------+------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +=====================+================================================================================================+ + | db_name | Name of the database. If this parameter is left unspecified, the current database is selected. | + +---------------------+------------------------------------------------------------------------------------------------+ + | table_name | Name of the table. | + +---------------------+------------------------------------------------------------------------------------------------+ + | col_name | Name of columns in a table. Column names contain letters, digits, and underscores (_). | + +---------------------+------------------------------------------------------------------------------------------------+ + | changed_column_type | The change in the data type. | + +---------------------+------------------------------------------------------------------------------------------------+ + +Usage Guidelines +---------------- + +- Change of decimal data type from lower precision to higher precision will only be supported for cases where there is no data loss. + + Example: + + - **Invalid scenario** - Change of decimal precision from (10,2) to (10,5) is not valid as in this case only scale is increased but total number of digits remain the same. + - **Valid scenario** - Change of decimal precision from (10,2) to (12,3) is valid as the total number of digits are increased by 2 but scale is increased only by 1 which will not lead to any data loss. + +- The allowed range is 38,38 (precision, scale) and is a valid upper case scenario which is not resulting in data loss. + +Examples +-------- + +- Changing data type of column a1 from INT to BIGINT. + + **ALTER TABLE** *test_db.carbon* **CHANGE** *a1 a1 BIGINT*; + +- Changing decimal precision of column a1 from 10 to 18. + + **ALTER TABLE** *test_db.carbon* **CHANGE** *a1 a1 DECIMAL(18,2)*; + +System Response +--------------- + +By running DESCRIBE command, the changed data type for the modified column is displayed. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/create_table.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/create_table.rst new file mode 100644 index 0000000..2f40605 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/create_table.rst @@ -0,0 +1,152 @@ +:original_name: mrs_01_1425.html + +.. _mrs_01_1425: + +CREATE TABLE +============ + +Function +-------- + +This command is used to create a CarbonData table by specifying the list of fields along with the table properties. + +Syntax +------ + +**CREATE TABLE** *[IF NOT EXISTS] [db_name.]table_name* + +*[(col_name data_type, ...)]* + +**STORED AS** *carbondata* + +*[TBLPROPERTIES (property_name=property_value, ...)];* + +Additional attributes of all tables are defined in **TBLPROPERTIES**. + +Parameter Description +--------------------- + +.. table:: **Table 1** CREATE TABLE parameters + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+==============================================================================================================================================================================================+ + | db_name | Database name that contains letters, digits, and underscores (_). | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | col_name data_type | List with data types separated by commas (,). The column name contains letters, digits, and underscores (_). | + | | | + | | .. note:: | + | | | + | | When creating a CarbonData table, do not use tupleId, PositionId, and PositionReference as column names because columns with these names are internally used by secondary index commands. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | table_name | Table name of a database that contains letters, digits, and underscores (_). | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | STORED AS | The **carbondata** parameter defines and creates a CarbonData table. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | TBLPROPERTIES | List of CarbonData table properties. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _mrs_01_1425__s4539dafd333c46ae855caaa175609f60: + +Precautions +----------- + +Table attributes are used as follows: + +- .. _mrs_01_1425__l053c6fa1a366488ea6410cb4bb4fc5d1: + + Block size + + The block size of a data file can be defined for a single table using **TBLPROPERTIES**. The larger one between the actual size of the data file and the defined block size is selected as the actual block size of the data file in HDFS. The unit is MB. The default value is 1024 MB. The value ranges from 1 MB to 2048 MB. If the value is beyond the range, the system reports an error. + + Once the block size reaches the configured value, the write program starts a new block of CarbonData data. Data is written in multiples of the page size (32,000 records). Therefore, the boundary is not strict at the byte level. If the new page crosses the boundary of the configured block, the page is written to the new block instead of the current block. + + *TBLPROPERTIES('table_blocksize'='128')* + + .. note:: + + - If a small block size is configured in the CarbonData table while the size of the data file generated by the loaded data is large, the block size displayed in HDFS is different from the configured value. This is because when data is written to a local block file for the first time, even though the size of the to-be-written data is larger than the configured value of the block size, data will still be written into the block. Therefore, the actual value of block size in HDFS is the larger value between the size of the data to be written and the configured block size. + - If **block.num** is less than the parallelism, the blocks are split into new blocks so that new blocks.num is greater than parallelism and all cores can be used. This optimization is called block distribution. + +- **SORT_SCOPE** specifies the sort scope during table creation. There are four types of sort scopes: + + - **GLOBAL_SORT**: It improves query performance, especially for point queries. *TBLPROPERTIES('SORT_SCOPE'='GLOBAL_SORT'*) + - **LOCAL_SORT**: Data is sorted locally (task-level sorting). + - **NO_SORT**: The default sorting mode is used. Data is loaded in unsorted manner, which greatly improves loading performance. + +- SORT_COLUMNS + + This table property specifies the order of sort columns. + + *TBLPROPERTIES('SORT_COLUMNS'='column1, column3')* + + .. note:: + + - If this attribute is not specified, no columns are sorted by default. + - If this property is specified but with empty argument, then the table will be loaded without sort. For example, *('SORT_COLUMNS'='')*. + - **SORT_COLUMNS** supports the string, date, timestamp, short, int, long, byte, and boolean data types. + +- RANGE_COLUMN + + This property is used to specify a column to partition the input data by range. Only one column can be configured. During data import, you can use **global_sort_partitions** or **scale_factor** to avoid generating small files. + + *TBLPROPERTIES('RANGE_COLUMN'='column1')* + +- LONG_STRING_COLUMNS + + The length of a common string cannot exceed 32,000 characters. To store a string of more than 32,000 characters, set **LONG_STRING_COLUMNS** to the target column. + + *TBLPROPERTIES('LONG_STRING_COLUMNS'='column1, column3')* + + .. note:: + + **LONG_STRING_COLUMNS** can be set only for columns of the STRING, CHAR, or VARCHAR type. + +Scenarios +--------- + +Creating a Table by Specifying Columns + +The **CREATE TABLE** command is the same as that of Hive DDL. The additional configurations of CarbonData are provided as table properties. + +**CREATE TABLE** *[IF NOT EXISTS] [db_name.]table_name* + +*[(col_name data_type , ...)]* + +STORED AS *carbondata* + +*[TBLPROPERTIES (property_name=property_value, ...)];* + +Examples +-------- + +**CREATE TABLE** *IF NOT EXISTS productdb.productSalesTable (* + +*productNumber Int,* + +*productName String,* + +*storeCity String,* + +*storeProvince String,* + +*productCategory String,* + +*productBatch String,* + +*saleQuantity Int,* + +*revenue Int)* + +*STORED AS carbondata* + +*TBLPROPERTIES (* + +*'table_blocksize'='128',* + +*'SORT_COLUMNS'='productBatch, productName')* + +System Response +--------------- + +A table will be created and the success message will be logged in system logs. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/create_table_as_select.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/create_table_as_select.rst new file mode 100644 index 0000000..d063aac --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/create_table_as_select.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_1426.html + +.. _mrs_01_1426: + +CREATE TABLE As SELECT +====================== + +Function +-------- + +This command is used to create a CarbonData table by specifying the list of fields along with the table properties. + +Syntax +------ + +**CREATE TABLE**\ * [IF NOT EXISTS] [db_name.]table_name* **STORED AS carbondata** *[TBLPROPERTIES (key1=val1, key2=val2, ...)] AS select_statement;* + +Parameter Description +--------------------- + +.. table:: **Table 1** CREATE TABLE parameters + + +---------------+----------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===============+============================================================================================================================+ + | db_name | Database name that contains letters, digits, and underscores (_). | + +---------------+----------------------------------------------------------------------------------------------------------------------------+ + | table_name | Table name of a database that contains letters, digits, and underscores (_). | + +---------------+----------------------------------------------------------------------------------------------------------------------------+ + | STORED AS | Used to store data in CarbonData format. | + +---------------+----------------------------------------------------------------------------------------------------------------------------+ + | TBLPROPERTIES | List of CarbonData table properties. For details, see :ref:`Precautions `. | + +---------------+----------------------------------------------------------------------------------------------------------------------------+ + +Precautions +----------- + +N/A + +Examples +-------- + +**CREATE TABLE** ctas_select_parquet **STORED AS** carbondata as select \* from parquet_ctas_test; + +System Response +--------------- + +This example will create a Carbon table from any Parquet table and load all the records from the Parquet table. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/drop_columns.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/drop_columns.rst new file mode 100644 index 0000000..a60ae38 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/drop_columns.rst @@ -0,0 +1,58 @@ +:original_name: mrs_01_1432.html + +.. _mrs_01_1432: + +DROP COLUMNS +============ + +Function +-------- + +This command is used to delete one or more columns from a table. + +Syntax +------ + +**ALTER TABLE** *[db_name.]table_name* **DROP COLUMNS** *(col_name, ...)*; + +Parameter Description +--------------------- + +.. table:: **Table 1** DROP COLUMNS parameters + + +------------+-------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +============+===================================================================================================================+ + | db_name | Database name. If this parameter is not specified, the current database is selected. | + +------------+-------------------------------------------------------------------------------------------------------------------+ + | table_name | Table name. | + +------------+-------------------------------------------------------------------------------------------------------------------+ + | col_name | Name of a column in a table. Multiple columns are supported. It consists of letters, digits, and underscores (_). | + +------------+-------------------------------------------------------------------------------------------------------------------+ + +Precautions +----------- + +After a column is deleted, at least one key column must exist in the schema. Otherwise, an error message is displayed, and the column fails to be deleted. + +Examples +-------- + +Assume that the table contains four columns named a1, b1, c1, and d1. + +- Delete a column: + + **ALTER TABLE** *carbon* **DROP COLUMNS** *(b1)*; + + **ALTER TABLE** *test_db.carbon* **DROP COLUMNS** *(b1)*; + +- Delete multiple columns: + + **ALTER TABLE** *carbon* **DROP COLUMNS** *(b1,c1)*; + + **ALTER TABLE** *test_db.carbon* **DROP COLUMNS** *(b1,c1)*; + +System Response +--------------- + +If you run the **DESCRIBE** command, the deleted columns will not be displayed. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/drop_table.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/drop_table.rst new file mode 100644 index 0000000..8781e5a --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/drop_table.rst @@ -0,0 +1,44 @@ +:original_name: mrs_01_1427.html + +.. _mrs_01_1427: + +DROP TABLE +========== + +Function +-------- + +This command is used to delete an existing table. + +Syntax +------ + +**DROP TABLE** *[IF EXISTS] [db_name.]table_name;* + +Parameter Description +--------------------- + +.. table:: **Table 1** DROP TABLE parameters + + +------------+--------------------------------------------------------------------------------------+ + | Parameter | Description | + +============+======================================================================================+ + | db_name | Database name. If this parameter is not specified, the current database is selected. | + +------------+--------------------------------------------------------------------------------------+ + | table_name | Name of the table to be deleted | + +------------+--------------------------------------------------------------------------------------+ + +Precautions +----------- + +In this command, **IF EXISTS** and **db_name** are optional. + +Example +------- + +**DROP TABLE IF EXISTS productDatabase.productSalesTable;** + +System Response +--------------- + +The table will be deleted. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/index.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/index.rst new file mode 100644 index 0000000..8ecdb90 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/index.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_1424.html + +.. _mrs_01_1424: + +DDL +=== + +- :ref:`CREATE TABLE ` +- :ref:`CREATE TABLE As SELECT ` +- :ref:`DROP TABLE ` +- :ref:`SHOW TABLES ` +- :ref:`ALTER TABLE COMPACTION ` +- :ref:`TABLE RENAME ` +- :ref:`ADD COLUMNS ` +- :ref:`DROP COLUMNS ` +- :ref:`CHANGE DATA TYPE ` +- :ref:`REFRESH TABLE ` +- :ref:`REGISTER INDEX TABLE ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + create_table + create_table_as_select + drop_table + show_tables + alter_table_compaction + table_rename + add_columns + drop_columns + change_data_type + refresh_table + register_index_table diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/refresh_table.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/refresh_table.rst new file mode 100644 index 0000000..e584df3 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/refresh_table.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_1434.html + +.. _mrs_01_1434: + +REFRESH TABLE +============= + +Function +-------- + +This command is used to register Carbon table to Hive meta store catalogue from exisiting Carbon table data. + +Syntax +------ + +**REFRESH TABLE** *db_name.table_name*; + +Parameter Description +--------------------- + +.. table:: **Table 1** REFRESH TABLE parameters + + +------------+------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +============+================================================================================================+ + | db_name | Name of the database. If this parameter is left unspecified, the current database is selected. | + +------------+------------------------------------------------------------------------------------------------+ + | table_name | Name of the table. | + +------------+------------------------------------------------------------------------------------------------+ + +Usage Guidelines +---------------- + +- The new database name and the old database name should be same. +- Before executing this command the old table schema and data should be copied into the new database location. +- If the table is aggregate table, then all the aggregate tables should be copied to the new database location. +- For old store, the time zone of the source and destination cluster should be same. +- If old cluster used HIVE meta store to store schema, refresh will not work as schema file does not exist in file system. + +Examples +-------- + +**REFRESH TABLE** *dbcarbon*.\ *productSalesTable*; + +System Response +--------------- + +By running this command, the Carbon table will be registered to Hive meta store catalogue from exisiting Carbon table data. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/register_index_table.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/register_index_table.rst new file mode 100644 index 0000000..580c41a --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/register_index_table.rst @@ -0,0 +1,68 @@ +:original_name: mrs_01_1435.html + +.. _mrs_01_1435: + +REGISTER INDEX TABLE +==================== + +Function +-------- + +This command is used to register an index table with the primary table. + +Syntax +------ + +**REGISTER INDEX TABLE** *indextable_name* ON *db_name.maintable_name*; + +Parameter Description +--------------------- + +.. table:: **Table 1** REFRESH INDEX TABLE parameters + + +-----------------+--------------------------------------------------------------------------------------+ + | Parameter | Description | + +=================+======================================================================================+ + | db_name | Database name. If this parameter is not specified, the current database is selected. | + +-----------------+--------------------------------------------------------------------------------------+ + | indextable_name | Index table name. | + +-----------------+--------------------------------------------------------------------------------------+ + | maintable_name | Primary table name. | + +-----------------+--------------------------------------------------------------------------------------+ + +Precautions +----------- + +Before running this command, run **REFRESH TABLE** to register the primary table and secondary index table with the Hive metastore. + +Examples +-------- + +**create database** **productdb;** + +**use productdb;** + +**CREATE TABLE productSalesTable(a int,b string,c string) stored as carbondata;** + +**create index productNameIndexTable on table productSalesTable(c) as 'carbondata';** + +**insert into table productSalesTable select 1,'a','aaa';** + +**create database productdb2;** + +Run the **hdfs** command to copy **productSalesTable** and **productNameIndexTable** in the **productdb** database to the **productdb2** database. + +**refresh table productdb2.productSalesTable ;** + +**refresh table productdb2.productNameIndexTable ;** + +**explain select \* from productdb2.productSalesTable where c = 'aaa';** / The query command does not use an index table. + +**REGISTER INDEX TABLE productNameIndexTable ON productdb2.productSalesTable;** + +**explain select \* from productdb2.productSalesTable where c = 'aaa';** // The query command uses an index table. + +System Response +--------------- + +By running this command, the index table will be registered to the primary table. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/show_tables.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/show_tables.rst new file mode 100644 index 0000000..1764b84 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/show_tables.rst @@ -0,0 +1,42 @@ +:original_name: mrs_01_1428.html + +.. _mrs_01_1428: + +SHOW TABLES +=========== + +Function +-------- + +**SHOW TABLES** command is used to list all tables in the current or a specific database. + +Syntax +------ + +**SHOW TABLES** *[IN db\_name];* + +Parameter Description +--------------------- + +.. table:: **Table 1** SHOW TABLE parameters + + +------------+---------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +============+===============================================================================================================+ + | IN db_name | Name of the database. This parameter is required only when tables of this specific database are to be listed. | + +------------+---------------------------------------------------------------------------------------------------------------+ + +Usage Guidelines +---------------- + +IN db_Name is optional. + +Examples +-------- + +**SHOW TABLES IN ProductDatabase;** + +System Response +--------------- + +All tables are listed. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/table_rename.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/table_rename.rst new file mode 100644 index 0000000..104990b --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/ddl/table_rename.rst @@ -0,0 +1,49 @@ +:original_name: mrs_01_1430.html + +.. _mrs_01_1430: + +TABLE RENAME +============ + +Function +-------- + +This command is used to rename an existing table. + +Syntax +------ + +**ALTER TABLE** *[db_name.]table_name* **RENAME TO** *new_table_name*; + +Parameter Description +--------------------- + +.. table:: **Table 1** RENAME parameters + + +----------------+--------------------------------------------------------------------------------------+ + | Parameter | Description | + +================+======================================================================================+ + | db_name | Database name. If this parameter is not specified, the current database is selected. | + +----------------+--------------------------------------------------------------------------------------+ + | table_name | Current name of the existing table | + +----------------+--------------------------------------------------------------------------------------+ + | new_table_name | New name of the existing table | + +----------------+--------------------------------------------------------------------------------------+ + +Precautions +----------- + +- Parallel queries (using table names to obtain paths for reading CarbonData storage files) may fail during this operation. +- The secondary index table cannot be renamed. + +Example +------- + +**ALTER TABLE** *carbon* **RENAME TO** *carbondata*; + +**ALTER TABLE** *test_db.carbon* **RENAME TO** *test_db.carbondata*; + +System Response +--------------- + +The new table name will be displayed in the CarbonData folder. You can run **SHOW TABLES** to view the new table name. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/clean_files.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/clean_files.rst new file mode 100644 index 0000000..b2b00f8 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/clean_files.rst @@ -0,0 +1,66 @@ +:original_name: mrs_01_1448.html + +.. _mrs_01_1448: + +CLEAN FILES +=========== + +Function +-------- + +After the **DELETE SEGMENT** command is executed, the deleted segments are marked as the **delete** state. After the segments are merged, the status of the original segments changes to **compacted**. The data files of these segments are not physically deleted. If you want to forcibly delete these files, run the **CLEAN FILES** command. + +However, running this command may result in a query command execution failure. + +Syntax +------ + +**CLEAN FILES FOR TABLE**\ * [db_name.]table_name* ; + +Parameter Description +--------------------- + +.. table:: **Table 1** CLEAN FILES FOR TABLE parameters + + +------------+----------------------------------------------------------------------------------+ + | Parameter | Description | + +============+==================================================================================+ + | db_name | Database name. It consists of letters, digits, and underscores (_). | + +------------+----------------------------------------------------------------------------------+ + | table_name | Name of the database table. It consists of letters, digits, and underscores (_). | + +------------+----------------------------------------------------------------------------------+ + +Precautions +----------- + +None + +Examples +-------- + +Add Carbon configuration parameters. + +.. code-block:: + + carbon.clean.file.force.allowed = true + +**create table carbon01(a int,b string,c string) stored as carbondata;** + +**insert into table carbon01 select 1,'a','aa';** + +**insert into table carbon01 select 2,'b','bb';** + +**delete from table carbon01 where segment.id in (0);** + +**show segments for table carbon01;** + +**CLEAN FILES FOR TABLE carbon01 options('force'='true');** + +**show segments for table carbon01;** + +In this example, all the segments marked as **deleted** and **compacted** are physically deleted. + +System Response +--------------- + +Success or failure will be recorded in the driver logs. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/create_secondary_index.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/create_secondary_index.rst new file mode 100644 index 0000000..b3aa315 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/create_secondary_index.rst @@ -0,0 +1,60 @@ +:original_name: mrs_01_1445.html + +.. _mrs_01_1445: + +CREATE SECONDARY INDEX +====================== + +Function +-------- + +This command is used to create secondary indexes in the CarbonData tables. + +Syntax +------ + +**CREATE INDEX** *index_name* + +**ON TABLE** *[db_name.]table_name (col_name1, col_name2)* + +**AS** *'carbondata*' + +**PROPERTIES** *('table_blocksize'='256')*; + +Parameter Description +--------------------- + +.. table:: **Table 1** CREATE SECONDARY INDEX parameters + + +-----------------+--------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +=================+==========================================================================================================================+ + | index_name | Index table name. It consists of letters, digits, and special characters (_). | + +-----------------+--------------------------------------------------------------------------------------------------------------------------+ + | db_name | Database name. It consists of letters, digits, and special characters (_). | + +-----------------+--------------------------------------------------------------------------------------------------------------------------+ + | table_name | Name of the database table. It consists of letters, digits, and special characters (_). | + +-----------------+--------------------------------------------------------------------------------------------------------------------------+ + | col_name | Name of a column in a table. Multiple columns are supported. It consists of letters, digits, and special characters (_). | + +-----------------+--------------------------------------------------------------------------------------------------------------------------+ + | table_blocksize | Block size of a data file. For details, see :ref:`•Block Size `. | + +-----------------+--------------------------------------------------------------------------------------------------------------------------+ + +Precautions +----------- + +**db_name** is optional. + +Examples +-------- + +**create table** **productdb.productSalesTable(id int,price int,productName string,city string) stored** **as** **carbondata;** + +**CREATE INDEX productNameIndexTable on table productdb.productSalesTable (productName,city) as 'carbondata' ;** + +In this example, a secondary table named **productdb.productNameIndexTable** is created and index information of the provided column is loaded. + +System Response +--------------- + +A secondary index table will be created. Index information related to the provided column will be loaded into the secondary index table. The success message will be recorded in system logs. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/delete_records_from_carbon_table.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/delete_records_from_carbon_table.rst new file mode 100644 index 0000000..362a0d3 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/delete_records_from_carbon_table.rst @@ -0,0 +1,66 @@ +:original_name: mrs_01_1440.html + +.. _mrs_01_1440: + +DELETE RECORDS from CARBON TABLE +================================ + +Function +-------- + +This command is used to delete records from a CarbonData table. + +Syntax +------ + +**DELETE FROM CARBON_TABLE [WHERE expression];** + +Parameter Description +--------------------- + +.. table:: **Table 1** DELETE RECORDS parameters + + +--------------+-------------------------------------------------------------------------+ + | Parameter | Description | + +==============+=========================================================================+ + | CARBON TABLE | Name of the CarbonData table in which the DELETE operation is performed | + +--------------+-------------------------------------------------------------------------+ + +Precautions +----------- + +- If a segment is deleted, all secondary indexes associated with the segment are deleted as well. + +- If the **carbon.input.segments** property has been set for the queried table, the DELETE operation fails. To solve this problem, run the following statement before the query: + + Syntax: + + **SET carbon.input.segments. .=*;** + +Examples +-------- + +- Example 1: + + **delete from columncarbonTable1 d where d.column1 = 'country';** + +- Example 2: + + **delete from dest where column1 IN ('country1', 'country2');** + +- Example 3: + + **delete from columncarbonTable1 where column1 IN (select column11 from sourceTable2);** + +- Example 4: + + **delete from columncarbonTable1 where column1 IN (select column11 from sourceTable2 where column1 = 'USA');** + +- Example 5: + + **delete from columncarbonTable1 where column2 >= 4;** + +System Response +--------------- + +Success or failure will be recorded in the driver log and on the client. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/delete_segment_by_date.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/delete_segment_by_date.rst new file mode 100644 index 0000000..e9068d7 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/delete_segment_by_date.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_1443.html + +.. _mrs_01_1443: + +DELETE SEGMENT by DATE +====================== + +Function +-------- + +This command is used to delete segments by loading date. Segments created before a specific date will be deleted. + +Syntax +------ + +**DELETE FROM TABLE db_name.table_name WHERE SEGMENT.STARTTIME BEFORE date_value**; + +Parameter Description +--------------------- + +.. table:: **Table 1** DELETE SEGMENT by DATE parameters + + +------------+----------------------------------------------------------------------------------------------+ + | Parameter | Description | + +============+==============================================================================================+ + | db_name | Database name. If this parameter is not specified, the current database is used. | + +------------+----------------------------------------------------------------------------------------------+ + | table_name | Name of a table in the specified database | + +------------+----------------------------------------------------------------------------------------------+ + | date_value | Valid date when segments are started to be loaded. Segments before the date will be deleted. | + +------------+----------------------------------------------------------------------------------------------+ + +Precautions +----------- + +Segments cannot be deleted from the stream table. + +Example +------- + +**DELETE FROM TABLE db_name.table_name WHERE SEGMENT.STARTTIME BEFORE '2017-07-01 12:07:20'**; + +**STARTTIME** indicates the loading start time of different loads. + +System Response +--------------- + +Success or failure will be recorded in CarbonData logs. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/delete_segment_by_id.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/delete_segment_by_id.rst new file mode 100644 index 0000000..be4b52a --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/delete_segment_by_id.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_1442.html + +.. _mrs_01_1442: + +DELETE SEGMENT by ID +==================== + +Function +-------- + +This command is used to delete segments by the ID. + +Syntax +------ + +**DELETE FROM TABLE db_name.table_name WHERE SEGMENT.ID IN (segment_id1,segment_id2)**; + +Parameter Description +--------------------- + +.. table:: **Table 1** DELETE SEGMENT parameters + + +------------+---------------------------------------------------------------------------------+ + | Parameter | Description | + +============+=================================================================================+ + | segment_id | ID of the segment to be deleted. | + +------------+---------------------------------------------------------------------------------+ + | db_name | Database name. If the parameter is not specified, the current database is used. | + +------------+---------------------------------------------------------------------------------+ + | table_name | The name of the table in a specific database. | + +------------+---------------------------------------------------------------------------------+ + +Usage Guidelines +---------------- + +Segments cannot be deleted from the stream table. + +Examples +-------- + +**DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.ID IN (0)**; + +**DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.ID IN (0,5,8)**; + +System Response +--------------- + +Success or failure will be recorded in the CarbonData log. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/drop_secondary_index.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/drop_secondary_index.rst new file mode 100644 index 0000000..4902f5a --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/drop_secondary_index.rst @@ -0,0 +1,46 @@ +:original_name: mrs_01_1447.html + +.. _mrs_01_1447: + +DROP SECONDARY INDEX +==================== + +Function +-------- + +This command is used to delete the existing secondary index table in a specific table. + +Syntax +------ + +**DROP INDEX** *[IF EXISTS] index_name*\ ** ON** *[db_name.]table_name*; + +Parameter Description +--------------------- + +.. table:: **Table 1** DROP SECONDARY INDEX parameters + + +------------+----------------------------------------------------------------------------------------+ + | Parameter | Description | + +============+========================================================================================+ + | index_name | Name of the index table. Table name contains letters, digits, and underscores (_). | + +------------+----------------------------------------------------------------------------------------+ + | db_Name | Name of the database. If the parameter is not specified, the current database is used. | + +------------+----------------------------------------------------------------------------------------+ + | table_name | Name of the table to be deleted. | + +------------+----------------------------------------------------------------------------------------+ + +Usage Guidelines +---------------- + +In this command, **IF EXISTS** and **db_name** are optional. + +Examples +-------- + +**DROP INDEX** *if exists productNameIndexTable* **ON** *productdb.productSalesTable*; + +System Response +--------------- + +Secondary Index Table will be deleted. Index information will be cleared in CarbonData table and the success message will be recorded in system logs. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/index.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/index.rst new file mode 100644 index 0000000..48d3ba3 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/index.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_1437.html + +.. _mrs_01_1437: + +DML +=== + +- :ref:`LOAD DATA ` +- :ref:`UPDATE CARBON TABLE ` +- :ref:`DELETE RECORDS from CARBON TABLE ` +- :ref:`INSERT INTO CARBON TABLE ` +- :ref:`DELETE SEGMENT by ID ` +- :ref:`DELETE SEGMENT by DATE ` +- :ref:`SHOW SEGMENTS ` +- :ref:`CREATE SECONDARY INDEX ` +- :ref:`SHOW SECONDARY INDEXES ` +- :ref:`DROP SECONDARY INDEX ` +- :ref:`CLEAN FILES ` +- :ref:`SET/RESET ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + load_data + update_carbon_table + delete_records_from_carbon_table + insert_into_carbon_table + delete_segment_by_id + delete_segment_by_date + show_segments + create_secondary_index + show_secondary_indexes + drop_secondary_index + clean_files + set_reset diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/insert_into_carbon_table.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/insert_into_carbon_table.rst new file mode 100644 index 0000000..b5100eb --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/insert_into_carbon_table.rst @@ -0,0 +1,70 @@ +:original_name: mrs_01_1441.html + +.. _mrs_01_1441: + +INSERT INTO CARBON TABLE +======================== + +Function +-------- + +This command is used to add the output of the SELECT command to a Carbon table. + +Syntax +------ + +**INSERT INTO [CARBON TABLE] [select query]**; + +Parameter Description +--------------------- + +.. table:: **Table 1** INSERT INTO parameters + + +--------------+---------------------------------------------------------------------------------------+ + | Parameter | Description | + +==============+=======================================================================================+ + | CARBON TABLE | Name of the CarbonData table to be inserted | + +--------------+---------------------------------------------------------------------------------------+ + | select query | SELECT query on the source table (CarbonData, Hive, and Parquet tables are supported) | + +--------------+---------------------------------------------------------------------------------------+ + +Precautions +----------- + +- A table has been created. + +- You must belong to the data loading group in order to perform data loading operations. By default, the data loading group is named **ficommon**. + +- CarbonData tables cannot be overwritten. + +- The data type of the source table and the target table must be the same. Otherwise, data in the source table will be regarded as bad records. + +- The **INSERT INTO** command does not support partial success. If bad records exist, the command fails. + +- When you insert data of the source table to the target table, you cannot upload or update data of the source table. + + To enable data loading or updating during the INSERT operation, set the following parameter to **true**. + + **carbon.insert.persist.enable**\ =\ **true** + + By default, the preceding parameters are set to **false**. + + .. note:: + + Enabling this property will reduce the performance of the INSERT operation. + +Example +------- + +**create table** **carbon01(a int,b string,c string) stored as carbondata;** + +**insert into table** **carbon01 values(1,'a','aa'),(2,'b','bb'),(3,'c','cc');** + +**create table** **carbon02(a int,b string,c string) stored as carbondata;** + +**INSERT INTO** **carbon02 select \* from carbon01 where a > 1;** + +System Response +--------------- + +Success or failure will be recorded in the driver logs. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/load_data.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/load_data.rst new file mode 100644 index 0000000..67621b8 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/load_data.rst @@ -0,0 +1,221 @@ +:original_name: mrs_01_1438.html + +.. _mrs_01_1438: + +LOAD DATA +========= + +Function +-------- + +This command is used to load user data of a particular type, so that CarbonData can provide good query performance. + +.. note:: + + Only the raw data on HDFS can be loaded. + +Syntax +------ + +**LOAD DATA** *INPATH 'folder_path' INTO TABLE [db_name.]table_name OPTIONS(property_name=property_value, ...);* + +Parameter Description +--------------------- + +.. table:: **Table 1** LOAD DATA parameters + + +-------------+----------------------------------------------------------------------------------+ + | Parameter | Description | + +=============+==================================================================================+ + | folder_path | Path of the file or folder used for storing the raw CSV data. | + +-------------+----------------------------------------------------------------------------------+ + | db_name | Database name. If this parameter is not specified, the current database is used. | + +-------------+----------------------------------------------------------------------------------+ + | table_name | Name of a table in a database. | + +-------------+----------------------------------------------------------------------------------+ + +Precautions +----------- + +The following configuration items are involved during data loading: + +- **DELIMITER**: Delimiters and quote characters provided in the load command. The default value is a comma (**,**). + + *OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='"')* + + You can use **'DELIMITER'='\\t'** to separate CSV data using tabs. + + OPTIONS('DELIMITER'='\\t') + + CarbonData also supports **\\001** and **\\017** as delimiters. + + .. note:: + + When the delimiter of CSV data is a single quotation mark ('), the single quotation mark must be enclosed in double quotation marks (" "). For example, 'DELIMITER'= "'". + +- **QUOTECHAR**: Delimiters and quote characters provided in the load command. The default value is double quotation marks (**"**). + + *OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='"')* + +- **COMMENTCHAR**: Comment characters provided in the load command. During data loading, if there is a comment character at the beginning of a line, the line is regarded as a comment line and data in the line will not be loaded. The default value is a pound key (#). + + *OPTIONS('COMMENTCHAR'='#')* + +- **FILEHEADER**: If the source file does not contain any header, add a header to the **LOAD DATA** command. + + *OPTIONS('FILEHEADER'='column1,column2')* + +- **ESCAPECHAR**: Is used to perform strict verification of the escape character on CSV files. The default value is backslash (**\\**). + + OPTIONS('ESCAPECHAR'='\\') + + .. note:: + + Enter **ESCAPECHAR** in the CSV data. **ESCAPECHAR** must be enclosed in double quotation marks (" "). For example, "a\\b". + +- .. _mrs_01_1438__lcf623574402c443e908646591898c2be: + + Bad records handling: + + In order for the data processing application to provide benefits, certain data integration is required. In most cases, data quality problems are caused by data sources. + + Methods of handling bad records are as follows: + + - Load all of the data before dealing with the errors. + - Clean or delete bad records before loading data or stop the loading when bad records are found. + + There are many options for clearing source data during CarbonData data loading, as listed in :ref:`Table 2 `. + + .. _mrs_01_1438__t1d4d77614e2b4b92b0f334d52702013b: + + .. table:: **Table 2** Bad Records Logger + + +---------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Configuration Item | Default Value | Description | + +===========================+=======================+======================================================================================================================================================================================================================================================+ + | BAD_RECORDS_LOGGER_ENABLE | false | Whether to create logs with details about bad records | + +---------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | BAD_RECORDS_ACTION | FAIL | The four types of actions for bad records are as follows: | + | | | | + | | | - **FORCE**: Auto-corrects the data by storing the bad records as NULL. | + | | | - **REDIRECT**: Bad records are written to the raw CSV instead of being loaded. | + | | | - **IGNORE**: Bad records are neither loaded nor written to the raw CSV. | + | | | - **FAIL**: Data loading fails if any bad records are found. | + | | | | + | | | .. note:: | + | | | | + | | | In loaded data, if all records are bad records, **BAD_RECORDS_ACTION** is invalid and the load operation fails. | + +---------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | IS_EMPTY_DATA_BAD_RECORD | false | Whether empty data of a column to be considered as bad record or not. If this parameter is set to **false**, empty data ("",', or,) is not considered as bad records. If this parameter is set to **true**, empty data is considered as bad records. | + +---------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | BAD_RECORD_PATH | ``-`` | HDFS path where bad records are stored. The default value is **Null**. If bad records logging or bad records operation redirection is enabled, the path must be configured by the user. | + +---------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + Example: + + **LOAD DATA INPATH** *'filepath.csv'* **INTO TABLE** *tablename* *OPTIONS('BAD_RECORDS_LOGGER_ENABLE'='true',* *'BAD_RECORD_PATH'='hdfs://hacluster/tmp/carbon', 'BAD_RECORDS_ACTION'='REDIRECT', 'IS_EMPTY_DATA_BAD_RECORD'='false');* + + .. note:: + + If **REDIRECT** is used, CarbonData will add all bad records into a separate CSV file. However, this file must not be used for subsequent data loading because the content may not exactly match the source record. You must clean up the source record for further data ingestion. This option is used to remind you which records are bad. + +- **MAXCOLUMNS**: (Optional) Specifies the maximum number of columns parsed by a CSV parser in a line. + + *OPTIONS('MAXCOLUMNS'='400')* + + .. table:: **Table 3** MAXCOLUMNS + + ============================== ============= ============= + Name of the Optional Parameter Default Value Maximum Value + ============================== ============= ============= + MAXCOLUMNS 2000 20000 + ============================== ============= ============= + + .. table:: **Table 4** Behavior chart of MAXCOLUMNS + + +-------------------------------+--------------------------------------+-----------------------------------------------------------------------------+ + | MAXCOLUMNS Value | Number of Columns in the File Header | Final Value Considered | + +===============================+======================================+=============================================================================+ + | Not specified in Load options | 5 | 2000 | + +-------------------------------+--------------------------------------+-----------------------------------------------------------------------------+ + | Not specified in Load options | 6000 | 6000 | + +-------------------------------+--------------------------------------+-----------------------------------------------------------------------------+ + | 40 | 7 | Max (column count of file header, MAXCOLUMNS value) | + +-------------------------------+--------------------------------------+-----------------------------------------------------------------------------+ + | 22000 | 40 | 20000 | + +-------------------------------+--------------------------------------+-----------------------------------------------------------------------------+ + | 60 | Not specified in Load options | Max (Number of columns in the first line of the CSV file, MAXCOLUMNS value) | + +-------------------------------+--------------------------------------+-----------------------------------------------------------------------------+ + + .. note:: + + There must be sufficient executor memory for setting the maximum value of **MAXCOLUMNS Option**. Otherwise, data loading will fail. + +- If **SORT_SCOPE** is set to **GLOBAL_SORT** during table creation, you can specify the number of partitions to be used when sorting data. If this parameter is not set or is set to a value less than **1**, the number of map tasks is used as the number of reduce tasks. It is recommended that each reduce task process 512 MB to 1 GB data. + + *OPTIONS('GLOBAL_SORT_PARTITIONS'='2')* + + .. note:: + + To increase the number of partitions, you may need to increase the value of **spark.driver.maxResultSize**, as the sampling data collected in the driver increases with the number of partitions. + +- **DATEFORMAT**: Specifies the date format of the table. + + *OPTIONS('DATEFORMAT'='dateFormat')* + + .. note:: + + Date formats are specified by date pattern strings. The date pattern letters in Carbon are same as in JAVA. + +- **TIMESTAMPFORMAT**: Specifies the timestamp of a table. +- *OPTIONS('TIMESTAMPFORMAT'='timestampFormat')* + +- **SKIP_EMPTY_LINE**: Ignores empty rows in the CSV file during data loading. + + *OPTIONS('SKIP_EMPTY_LINE'='TRUE/FALSE')* + +- **Optional:** **SCALE_FACTOR**: Used to control the number of partitions for **RANGE_COLUMN**, **SCALE_FACTOR**. The formula is as follows: + + .. code-block:: text + + splitSize = max(blocklet_size, (block_size - blocklet_size)) * scale_factor + numPartitions = total size of input data / splitSize + + The default value is **3**. The value ranges from **1** to **300**. + + *OPTIONS('SCALE_FACTOR'='10')* + + .. note:: + + - If **GLOBAL_SORT_PARTITIONS** and **SCALE_FACTOR** are used at the same time, only **GLOBAL_SORT_PARTITIONS** is valid. + - The compaction on **RANGE_COLUMN** will use **LOCAL_SORT** by default. + +Scenarios +--------- + +To load a CSV file to a CarbonData table, run the following statement: + +**LOAD DATA** *INPATH 'folder path' INTO TABLE tablename OPTIONS(property_name=property_value, ...);* + +Examples +-------- + +The data in the **data.csv** file is as follows: + +.. code-block:: + + ID,date,country,name,phonetype,serialname,salary + 4,2014-01-21 00:00:00,city1,aaa4,phone2435,ASD66902,15003 + 5,2014-01-22 00:00:00,city1,aaa5,phone2441,ASD90633,15004 + 6,2014-03-07 00:00:00,city1,aaa6,phone294,ASD59961,15005 + +CREATE TABLE carbontable(ID int, date Timestamp, country String, name String, phonetype String, serialname String,salary int) STORED AS carbondata; + +**LOAD DATA** *inpath 'hdfs://hacluster/tmp/data.csv' INTO table carbontable* + +*options('DELIMITER'=',');* + +System Response +--------------- + +Success or failure will be recorded in the driver logs. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/set_reset.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/set_reset.rst new file mode 100644 index 0000000..67ed044 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/set_reset.rst @@ -0,0 +1,143 @@ +:original_name: mrs_01_1449.html + +.. _mrs_01_1449: + +SET/RESET +========= + +Function +-------- + +This command is used to dynamically add, update, display, or reset the CarbonData properties without restarting the driver. + +Syntax +------ + +- Add or Update parameter value: + + **SET** *parameter_name*\ =\ *parameter_value* + + This command is used to add or update the value of **parameter_name**. + +- Display property value: + + **SET** *parameter_name* + + This command is used to display the value of **parameter_name**. + +- Display session parameter: + + **SET** + + This command is used to display all supported session parameters. + +- Display session parameters along with usage details: + + **SET** -v + + This command is used to display all supported session parameters and their usage details. + +- Reset parameter value: + + **RESET** + + This command is used to clear all session parameters. + +Parameter Description +--------------------- + +.. table:: **Table 1** SET parameters + + +-----------------+----------------------------------------------------------------------------------------+ + | Parameter | Description | + +=================+========================================================================================+ + | parameter_name | Name of the parameter whose value needs to be dynamically added, updated, or displayed | + +-----------------+----------------------------------------------------------------------------------------+ + | parameter_value | New value of **parameter_name** to be set | + +-----------------+----------------------------------------------------------------------------------------+ + +Precautions +----------- + +The following table lists the properties which you can set or clear using the SET or RESET command. + +.. table:: **Table 2** Properties + + +------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Property | Description | + +==========================================+======================================================================================================================================================================================+ + | carbon.options.bad.records.logger.enable | Whether to enable bad record logger. | + +------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.options.bad.records.action | Operations on bad records, for example, force, redirect, fail, or ignore. For more information, see :ref:`•Bad record handling `. | + +------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.options.is.empty.data.bad.record | Whether the empty data is considered as a bad record. For more information, see :ref:`Bad record handling `. | + +------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.options.sort.scope | Scope of the sort during data loading. | + +------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.options.bad.record.path | HDFS path where bad records are stored. | + +------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.custom.block.distribution | Whether to enable Spark or CarbonData block distribution. | + +------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | enable.unsafe.sort | Whether to use unsafe sort during data loading. Unsafe sort reduces the garbage collection during data loading, thereby achieving better performance. | + +------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.si.lookup.partialstring | If this is set to **TRUE**, the secondary index uses the starts-with, ends-with, contains, and LIKE partition condition strings. | + | | | + | | If this is set to **FALSE**, the secondary index uses only the starts-with partition condition string. | + +------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.input.segments | Segment ID to be queried. This property allows you to query a specified segment of a specified table. CarbonScan reads data only from the specified segment ID. | + | | | + | | Syntax: | + | | | + | | **carbon.input.segments. . = < list of segment ids >** | + | | | + | | If you want to query a specified segment in multi-thread mode, you can use **CarbonSession.threadSet** instead of the **SET** statement. | + | | | + | | Syntax: | + | | | + | | **CarbonSession.threadSet ("carbon.input.segments. . ","< list of segment ids >");** | + | | | + | | .. note:: | + | | | + | | You are advised not to set this property in the **carbon.properties** file because all sessions contain the segment list unless session-level or thread-level overwriting occurs. | + +------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Examples +-------- + +- Add or Update: + + **SET** *enable.unsafe.sort*\ =\ *true* + +- Display property value: + + **SET** *enable.unsafe.sort* + +- Show the segment ID list, segment status, and other required details, and specify the segment list to be read: + + **SHOW SEGMENTS FOR** *TABLE carbontable1;* + + **SET** *carbon.input.segments.db.carbontable1 = 1, 3, 9;* + +- Query a specified segment in multi-thread mode: + + **CarbonSession.threadSet** (*"carbon.input.segments.default.carbon_table_MulTI_THread", "1,3"*); + +- Use **CarbonSession.threadSet** to query segments in a multi-thread environment (Scala code is used as an example): + + .. code-block:: + + def main(args: Array[String]) { + Future { CarbonSession.threadSet("carbon.input.segments.default.carbon_table_MulTI_THread", "1") + spark.sql("select count(empno) from carbon_table_MulTI_THread").show() + } + } + +- Reset: + + **RESET** + +System Response +--------------- + +- Success will be recorded in the driver log. +- Failure will be displayed on the UI. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/show_secondary_indexes.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/show_secondary_indexes.rst new file mode 100644 index 0000000..a6a792d --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/show_secondary_indexes.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_1446.html + +.. _mrs_01_1446: + +SHOW SECONDARY INDEXES +====================== + +Function +-------- + +This command is used to list all secondary index tables in the CarbonData table. + +Syntax +------ + +**SHOW INDEXES ON db_name.table_name**; + +Parameter Description +--------------------- + +.. table:: **Table 1** SHOW SECONDARY INDEXES parameters + + +------------+-----------------------------------------------------------------------------------------+ + | Parameter | Description | + +============+=========================================================================================+ + | db_name | Database name. It consists of letters, digits, and special characters (_). | + +------------+-----------------------------------------------------------------------------------------+ + | table_name | Name of the database table. It consists of letters, digits, and special characters (_). | + +------------+-----------------------------------------------------------------------------------------+ + +Precautions +----------- + +**db_name** is optional. + +Examples +-------- + +**create table productdb.productSalesTable(id int,price int,productName string,city string) stored as carbondata;** + +**CREATE INDEX productNameIndexTable on table productdb.productSalesTable (productName,city) as 'carbondata' ;** + +**SHOW INDEXES ON productdb.productSalesTable**; + +System Response +--------------- + +All index tables and corresponding index columns in a given CarbonData table will be listed. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/show_segments.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/show_segments.rst new file mode 100644 index 0000000..f1886d4 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/show_segments.rst @@ -0,0 +1,61 @@ +:original_name: mrs_01_1444.html + +.. _mrs_01_1444: + +SHOW SEGMENTS +============= + +Function +-------- + +This command is used to list the segments of a CarbonData table. + +Syntax +------ + +**SHOW SEGMENTS FOR TABLE** *[db_name.]table_name* **LIMIT** *number_of_loads;* + +Parameter Description +--------------------- + +.. table:: **Table 1** SHOW SEGMENTS FOR TABLE parameters + + +-----------------+----------------------------------------------------------------------------------+ + | Parameter | Description | + +=================+==================================================================================+ + | db_name | Database name. If this parameter is not specified, the current database is used. | + +-----------------+----------------------------------------------------------------------------------+ + | table_name | Name of a table in the specified database | + +-----------------+----------------------------------------------------------------------------------+ + | number_of_loads | Threshold of records to be listed | + +-----------------+----------------------------------------------------------------------------------+ + +Precautions +----------- + +None + +Examples +-------- + +**create table** **carbon01(a int,b string,c string) stored as carbondata;** + +**insert into** **table** **carbon01 select 1,'a','aa';** + +**insert into table** **carbon01 select 2,'b','bb';** + +**insert into table** **carbon01 select 3,'c','cc';** + +**SHOW SEGMENTS FOR TABLE** **carbon01** **LIMIT** **2;** + +System Response +--------------- + +.. code-block:: + + +-----+----------+--------------------------+------------------+------------+------------+-------------+--------------+--+ + | ID | Status | Load Start Time | Load Time Taken | Partition | Data Size | Index Size | File Format | + +-----+----------+--------------------------+------------------+------------+------------+-------------+--------------+--+ + | 3 | Success | 2020-09-28 22:53:26.336 | 3.726S | {} | 6.47KB | 3.30KB | columnar_v3 | + | 2 | Success | 2020-09-28 22:53:01.702 | 6.688S | {} | 6.47KB | 3.30KB | columnar_v3 | + +-----+----------+--------------------------+------------------+------------+------------+-------------+--------------+--+ diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/update_carbon_table.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/update_carbon_table.rst new file mode 100644 index 0000000..77b813d --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/dml/update_carbon_table.rst @@ -0,0 +1,94 @@ +:original_name: mrs_01_1439.html + +.. _mrs_01_1439: + +UPDATE CARBON TABLE +=================== + +Function +-------- + +This command is used to update the CarbonData table based on the column expression and optional filtering conditions. + +Syntax +------ + +- Syntax 1: + + **UPDATE SET (column_name1, column_name2, ... column_name n) = (column1_expression , column2_expression , column3_expression ... column n_expression ) [ WHERE { } ];** + +- Syntax 2: + + **UPDATE SET (column_name1, column_name2,) = (select sourceColumn1, sourceColumn2 from sourceTable [ WHERE { } ] ) [ WHERE { } ];** + +Parameter Description +--------------------- + +.. table:: **Table 1** UPDATE parameters + + +--------------+-------------------------------------------------------------------------------+ + | Parameter | Description | + +==============+===============================================================================+ + | CARBON TABLE | Name of the CarbonData table to be updated | + +--------------+-------------------------------------------------------------------------------+ + | column_name | Target column to be updated | + +--------------+-------------------------------------------------------------------------------+ + | sourceColumn | Column value of the source table that needs to be updated in the target table | + +--------------+-------------------------------------------------------------------------------+ + | sourceTable | Table from which the records are updated to the target table | + +--------------+-------------------------------------------------------------------------------+ + +Precautions +----------- + +Note the following before running this command: + +- The UPDATE command fails if multiple input rows in the source table are matched with a single row in the target table. + +- If the source table generates empty records, the UPDATE operation completes without updating the table. + +- If rows in the source table do not match any existing rows in the target table, the UPDATE operation completes without updating the table. + +- UPDATE is not allowed in the table with secondary index. + +- In a subquery, if the source table and target table are the same, the UPDATE operation fails. + +- The UPDATE operation fails if the subquery used in the UPDATE command contains an aggregate function or a GROUP BY clause. + + For example, **update t_carbn01 a set (a.item_type_code, a.profit) = ( select b.item_type_cd, sum(b.profit) from t_carbn01b b where item_type_cd =2 group by item_type_code);**. + + In the preceding example, aggregate function **sum(b.profit)** and GROUP BY clause are used in the subquery. As a result, the UPDATE operation will fail. + +- If the **carbon.input.segments** property has been set for the queried table, the UPDATE operation fails. To solve this problem, run the following statement before the query: + + Syntax: + + **SET carbon.input.segments. . =\***; + +Examples +-------- + +- Example 1: + + **update carbonTable1 d set (d.column3,d.column5 ) = (select s.c33 ,s.c55 from sourceTable1 s where d.column1 = s.c11) where d.column1 = 'country' exists( select \* from table3 o where o.c2 > 1);** + +- Example 2: + + **update carbonTable1 d set (c3) = (select s.c33 from sourceTable1 s where d.column1 = s.c11) where exists( select \* from iud.other o where o.c2 > 1);** + +- Example 3: + + **update carbonTable1 set (c2, c5 ) = (c2 + 1, concat(c5 , "y" ));** + +- Example 4: + + **update carbonTable1 d set (c2, c5 ) = (c2 + 1, "xyx") where d.column1 = 'india';** + +- Example 5: + + **update carbonTable1 d set (c2, c5 ) = (c2 + 1, "xyx") where d.column1 = 'india' and exists( select \* from table3 o where o.column2 > 1);** + +System Response +--------------- + +Success or failure will be recorded in the driver log and on the client. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/index.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/index.rst new file mode 100644 index 0000000..c588d29 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/index.rst @@ -0,0 +1,22 @@ +:original_name: mrs_01_1423.html + +.. _mrs_01_1423: + +CarbonData Syntax Reference +=========================== + +- :ref:`DDL ` +- :ref:`DML ` +- :ref:`Operation Concurrent Execution ` +- :ref:`API ` +- :ref:`Spatial Indexes ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + ddl/index + dml/index + operation_concurrent_execution + api + spatial_indexes diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/operation_concurrent_execution.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/operation_concurrent_execution.rst new file mode 100644 index 0000000..f990a4f --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/operation_concurrent_execution.rst @@ -0,0 +1,70 @@ +:original_name: mrs_01_24046.html + +.. _mrs_01_24046: + +Operation Concurrent Execution +============================== + +Before performing :ref:`DDL ` and :ref:`DML ` operations, you need to obtain the corresponding locks. See :ref:`Table 1 ` for details about the locks that need to be obtained for each operation. The check mark (Y) indicates that the lock is required. An operation can be performed only after all required locks are obtained. + +You can check whether any two operations can be executed concurrently by using the following method: The first two lines in :ref:`Table 1 ` indicate two operations. If no column in the two lines is marked with the check mark (Y), the two operations can be executed concurrently. That is, if the columns with check marks (Y) in the two lines do not exist, the two operations can be executed concurrently. + +.. _mrs_01_24046__table1548533815231: + +.. table:: **Table 1** List of obtaining locks for operations + + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | Operation | METADATA_LOCK | COMPACTION_LOCK | DROP_TABLE_LOCK | DELETE_SEGMENT_LOCK | CLEAN_FILES_LOCK | ALTER_PARTITION_LOCK | UPDATE_LOCK | STREAMING_LOCK | CONCURRENT_LOAD_LOCK | SEGMENT_LOCK | + +==================================+===============+=================+=================+=====================+==================+======================+=============+================+======================+==============+ + | CREATE TABLE | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | CREATE TABLE As SELECT | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | DROP TABLE | Y | ``-`` | Y | ``-`` | ``-`` | ``-`` | ``-`` | Y | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | ALTER TABLE COMPACTION | ``-`` | Y | ``-`` | ``-`` | ``-`` | ``-`` | Y | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | TABLE RENAME | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | ADD COLUMNS | Y | Y | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | DROP COLUMNS | Y | Y | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | CHANGE DATA TYPE | Y | Y | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | REFRESH TABLE | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | REGISTER INDEX TABLE | Y | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | REFRESH INDEX | ``-`` | Y | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | LOAD DATA/INSERT INTO | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | Y | Y | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | UPDATE CARBON TABLE | Y | Y | ``-`` | ``-`` | ``-`` | ``-`` | Y | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | DELETE RECORDS from CARBON TABLE | Y | Y | ``-`` | ``-`` | ``-`` | ``-`` | Y | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | DELETE SEGMENT by ID | ``-`` | ``-`` | ``-`` | Y | Y | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | DELETE SEGMENT by DATE | ``-`` | ``-`` | ``-`` | Y | Y | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | SHOW SEGMENTS | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | CREATE SECONDARY INDEX | Y | Y | ``-`` | Y | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | SHOW SECONDARY INDEXES | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | DROP SECONDARY INDEX | Y | ``-`` | Y | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | CLEAN FILES | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | SET/RESET | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | Add Hive Partition | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | Drop Hive Partition | Y | Y | Y | Y | Y | Y | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | Drop Partition | Y | Y | Y | Y | Y | Y | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ + | Alter table set | Y | Y | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | ``-`` | + +----------------------------------+---------------+-----------------+-----------------+---------------------+------------------+----------------------+-------------+----------------+----------------------+--------------+ diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/spatial_indexes.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/spatial_indexes.rst new file mode 100644 index 0000000..6c6e291 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_syntax_reference/spatial_indexes.rst @@ -0,0 +1,656 @@ +:original_name: mrs_01_1451.html + +.. _mrs_01_1451: + +Spatial Indexes +=============== + +Quick Example +------------- + +.. code-block:: + + create table IF NOT EXISTS carbonTable + ( + COLUMN1 BIGINT, + LONGITUDE BIGINT, + LATITUDE BIGINT, + COLUMN2 BIGINT, + COLUMN3 BIGINT + ) + STORED AS carbondata + TBLPROPERTIES ('SPATIAL_INDEX.mygeohash.type'='geohash','SPATIAL_INDEX.mygeohash.sourcecolumns'='longitude, latitude','SPATIAL_INDEX.mygeohash.originLatitude'='39.850713','SPATIAL_INDEX.mygeohash.gridSize'='50','SPATIAL_INDEX.mygeohash.minLongitude'='115.828503','SPATIAL_INDEX.mygeohash.maxLongitude'='720.000000','SPATIAL_INDEX.mygeohash.minLatitude'='39.850713','SPATIAL_INDEX.mygeohash.maxLatitude'='720.000000','SPATIAL_INDEX'='mygeohash','SPATIAL_INDEX.mygeohash.conversionRatio'='1000000','SORT_COLUMNS'='column1,column2,column3,latitude,longitude'); + +Introduction to Spatial Indexes +------------------------------- + +Spatial data includes multidimensional points, lines, rectangles, cubes, polygons, and other geometric objects. A spatial data object occupies a certain region of space, called spatial scope, characterized by its location and boundary. The spatial data can be either point data or region data. + +- Point data: A point has a spatial extent characterized completely by its location. It does not occupy space and has no associated boundary. Point data consists of a collection of points in a two-dimensional space. Points can be stored as a pair of longitude and latitude. +- Region data: A region has a spatial extent with a location, and boundary. The location can be considered as the position of a fixed point in the region, such as its centroid. In two dimensions, the boundary can be visualized as a line (for finite regions, a closed loop). Region data contains a collection of regions. + +Currently, only point data is supported, and it can be stored. + +Longitude and latitude can be encoded as a unique GeoID. Geohash is a public-domain geocoding system invented by Gustavo Niemeyer. It encodes geographical locations into a short string of letters and digits. It is a hierarchical spatial data structure which subdivides the space into buckets of grid shape, which is one of the many applications of what is known as the Z-order curve, and generally the space-filling curve. + +The Z value of a point in multiple dimensions is calculated by interleaving the binary representation of its coordinate value, as shown in the following figure. When Geohash is used to create a GeoID, data is sorted by GeoID instead of longitude and latitude. Data is stored by spatial proximity. + +|image1| + +Creating a Table +---------------- + +**GeoHash encoding**: + +.. code-block:: + + create table IF NOT EXISTS carbonTable + ( + ... + `LONGITUDE` BIGINT, + `LATITUDE` BIGINT, + ... + ) + STORED AS carbondata + TBLPROPERTIES ('SPATIAL_INDEX.mygeohash.type'='geohash','SPATIAL_INDEX.mygeohash.sourcecolumns'='longitude, latitude','SPATIAL_INDEX.mygeohash.originLatitude'='xx.xxxxxx','SPATIAL_INDEX.mygeohash.gridSize'='xx','SPATIAL_INDEX.mygeohash.minLongitude'='xxx.xxxxxx','SPATIAL_INDEX.mygeohash.maxLongitude'='xxx.xxxxxx','SPATIAL_INDEX.mygeohash.minLatitude'='xx.xxxxxx','SPATIAL_INDEX.mygeohash.maxLatitude'='xxx.xxxxxx','SPATIAL_INDEX'='mygeohash','SPATIAL_INDEX.mygeohash.conversionRatio'='1000000','SORT_COLUMNS'='column1,column2,column3,latitude,longitude'); + +**SPATIAL_INDEX** is a user-defined index handler. This handler allows users to create new columns from the table-structure column set. The new column name is the same as that of the handler name. The **type** and **sourcecolumns** properties of the handler are mandatory. Currently, the value of **type** supports only **geohash**. Carbon provides a default implementation class that can be easily used. You can extend the default implementation class to mount the customized implementation class of **geohash**. The default handler also needs to provide the following table properties: + +- **SPATIAL_INDEX.**\ *xxx*\ **.originLatitude**: specifies the origin latitude. (**Double** type.) +- **SPATIAL_INDEX.**\ *xxx*\ **.gridSize**: specifies the grid length in meters. (**Int** type.) +- **SPATIAL_INDEX.**\ *xxx*\ **.minLongitude**: specifies the minimum longitude. (**Double** type.) +- **SPATIAL_INDEX.**\ *xxx*\ **.maxLongitude**: specifies the maximum longitude. (**Double** type.) +- **SPATIAL_INDEX.**\ *xxx*\ **.minLatitude**: specifies the minimum latitude. (**Double** type.) +- **SPATIAL_INDEX.**\ *xxx*\ **.maxLatitude**: specifies the maximum latitude. (**Double** type.) +- **SPATIAL_INDEX.**\ *xxx*\ **.conversionRatio**: used to convert the small value of the longitude and latitude to an integer. (**Int** type.) + +You can add your own table properties to the handlers in the above format and access them in your custom implementation class. **originLatitude**, **gridSize**, and **conversionRatio** are mandatory. Other parameters are optional in Carbon. You can use the **SPATIAL_INDEX.**\ *xxx*\ **.class** property to specify their implementation classes. + +The default implementation class can generate handler column values for **sourcecolumns** in each row and support query based on the **sourcecolumns** filter criteria. The generated handler column is invisible to users. Except the **SORT_COLUMNS** table properties, no DDL commands or properties are allowed to contain the handler column. + +.. note:: + + - By default, the generated handler column is regarded as the sorting column. If **SORT_COLUMNS** does not contain any **sourcecolumns**, add the handler column to the end of the existing **SORT_COLUMNS**. If the handler column has been specified in **SORT_COLUMNS**, its order in **SORT_COLUMNS** remains unchanged. + - If **SORT_COLUMNS** contains any **sourcecolumns** but does not contain the handler column, the handler column is automatically inserted before **sourcecolumns** in **SORT_COLUMNS**. + - If **SORT_COLUMNS** needs to contain any **sourcecolumns**, ensure that the handler column is listed before the **sourcecolumns** so that the handler column can take effect during sorting. + +**GeoSOT encoding**: + +.. code-block:: + + CREATE TABLE carbontable( + ... + longitude DOUBLE, + latitude DOUBLE, + ...) + STORED AS carbondata + TBLPROPERTIES ('SPATIAL_INDEX'='xxx', + 'SPATIAL_INDEX.xxx.type'='geosot', + 'SPATIAL_INDEX.xxx.sourcecolumns'='longitude, latitude', + 'SPATIAL_INDEX.xxx.level'='21', + 'SPATIAL_INDEX.xxx.class'='org.apache.carbondata.geo.GeoSOTIndex') + +.. table:: **Table 1** Parameter description + + +---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +=================================+=========================================================================================================================================================================================+ + | SPATIAL_INDEX | Specifies the spatial index. Its value is the same as the column name. | + +---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SPATIAL_INDEX.xxx.type | (Mandatory) The value is set to **geosot**. | + +---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SPATIAL_INDEX.xxx.sourcecolumns | (Mandatory) Specifies the source columns for calculating the spatial index. The value must be two existing columns of the double type. | + +---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SPATIAL_INDEX.xxx.level | (Optional) Specifies the columns for calculating the spatial index. The default value is **17**, through which you can obtain an accurate result and improve the computing performance. | + +---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SPATIAL_INDEX.xxx.class | (Optional) Specifies the implementation class of GeoSOT. The default value is **org.apache.carbondata.geo.GeoSOTIndex**. | + +---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Example: + +.. code-block:: + + create table geosot( + timevalue bigint, + longitude double, + latitude double) + stored as carbondata + TBLPROPERTIES ('SPATIAL_INDEX'='mygeosot', + 'SPATIAL_INDEX.mygeosot.type'='geosot', + 'SPATIAL_INDEX.mygeosot.level'='21', 'SPATIAL_INDEX.mygeosot.sourcecolumns'='longitude, latitude'); + +.. _mrs_01_1451__section106234720257: + +Preparing Data +-------------- + +- Data file 1: **geosotdata.csv** + + .. code-block:: + + timevalue,longitude,latitude + 1575428400000,116.285807,40.084087 + 1575428400000,116.372142,40.129503 + 1575428400000,116.187332,39.979316 + 1575428400000,116.337069,39.951887 + 1575428400000,116.359102,40.154684 + 1575428400000,116.736367,39.970323 + 1575428400000,116.720179,40.009893 + 1575428400000,116.346961,40.13355 + 1575428400000,116.302895,39.930753 + 1575428400000,116.288955,39.999101 + 1575428400000,116.17609,40.129953 + 1575428400000,116.725575,39.981115 + 1575428400000,116.266922,40.179415 + 1575428400000,116.353706,40.156483 + 1575428400000,116.362699,39.942444 + 1575428400000,116.325378,39.963129 + +- Data file 2: **geosotdata2.csv** + + .. code-block:: + + timevalue,longitude,latitude + 1575428400000,120.17708,30.326882 + 1575428400000,120.180685,30.326327 + 1575428400000,120.184976,30.327105 + 1575428400000,120.189311,30.327549 + 1575428400000,120.19446,30.329698 + 1575428400000,120.186965,30.329133 + 1575428400000,120.177481,30.328911 + 1575428400000,120.169713,30.325614 + 1575428400000,120.164563,30.322243 + 1575428400000,120.171558,30.319613 + 1575428400000,120.176365,30.320687 + 1575428400000,120.179669,30.323688 + 1575428400000,120.181001,30.320761 + 1575428400000,120.187094,30.32354 + 1575428400000,120.193574,30.323651 + 1575428400000,120.186192,30.320132 + 1575428400000,120.190055,30.317464 + 1575428400000,120.195376,30.318094 + 1575428400000,120.160786,30.317094 + 1575428400000,120.168211,30.318057 + 1575428400000,120.173618,30.316612 + 1575428400000,120.181001,30.317316 + 1575428400000,120.185162,30.315908 + 1575428400000,120.192415,30.315871 + 1575428400000,120.161902,30.325614 + 1575428400000,120.164306,30.328096 + 1575428400000,120.197093,30.325985 + 1575428400000,120.19602,30.321651 + 1575428400000,120.198638,30.32354 + 1575428400000,120.165421,30.314834 + +Importing Data +-------------- + +The GeoHash default implementation class extends the customized index abstract class. If the handler property is not set to a customized implementation class, the default implementation class is used. You can extend the default implementation class to mount the customized implementation class of **geohash**. The methods of the customized index abstract class are as follows: + +- **Init** method: Used to extract, verify, and store the handler property. If the operation fails, the system throws an exception and displays the error information. +- **Generate** method: Used to generate indexes. It generates an index for each row of data. +- **Query** method: Used to generate an index value range list for given input. + +The commands for importing data are the same as those for importing common Carbon tables. + +**LOAD DATA inpath '/tmp/**\ *geosotdata.csv*\ **' INTO TABLE geosot OPTIONS ('DELIMITER'= ',');** + +**LOAD DATA inpath '/tmp/**\ *geosotdata2.csv*\ **' INTO TABLE geosot OPTIONS ('DELIMITER'= ',');** + +.. note:: + + For details about **geosotdata.csv** and **geosotdata2.csv**, see :ref:`Preparing Data `. + +Aggregate Query of Irregular Spatial Sets +----------------------------------------- + +**Query statements and filter UDFs** + +- Filtering data based on polygon + + **IN_POLYGON(pointList)** + + UDF input parameter + + +-----------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Type | Description | + +===========+========+================================================================================================================================================================================================================================================================================================+ + | pointList | String | Enter multiple points as a string. Each point is presented as **longitude latitude**. Longitude and latitude are separated by a space. Each pair of longitude and latitude is separated by a comma (,). The longitude and latitude values at the start and end of the string must be the same. | + +-----------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + UDF output parameter + + +-----------+---------+-----------------------------------------------------------+ + | Parameter | Type | Description | + +===========+=========+===========================================================+ + | inOrNot | Boolean | Checks whether data is in the specified **polygon_list**. | + +-----------+---------+-----------------------------------------------------------+ + + Example: + + .. code-block:: + + select longitude, latitude from geosot where IN_POLYGON('116.321011 40.123503, 116.137676 39.947911, 116.560993 39.935276, 116.321011 40.123503'); + +- Filtering data based on the polygon list + + **IN_POLYGON_LIST(polygonList, opType)** + + UDF input parameters + + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Type | Description | + +=======================+=======================+=========================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | polygonList | String | Inputs multiple polygons as a string. Each polygon is presented as **POLYGON ((longitude1 latitude1, longitude2 latitude2, …))**. Note that there is a space after **POLYGON**. Longitudes and latitudes are separated by spaces. Each pair of longitude and latitude is separated by a comma (,). The longitudes and latitudes at the start and end of a polygon must be the same. **IN_POLYGON_LIST** requires at least two polygons. | + | | | | + | | | Example: | + | | | | + | | | .. code-block:: | + | | | | + | | | POLYGON ((116.137676 40.163503, 116.137676 39.935276, 116.560993 39.935276, 116.137676 40.163503)) | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | opType | String | Performs union, intersection, and subtraction on multiple polygons. | + | | | | + | | | Currently, the following operation types are supported: | + | | | | + | | | - OR: A U B U C (Assume that three polygons A, B, and C are input.) | + | | | - AND: A ∩ B ∩ C | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + UDF output parameter + + +-----------+---------+-----------------------------------------------------------+ + | Parameter | Type | Description | + +===========+=========+===========================================================+ + | inOrNot | Boolean | Checks whether data is in the specified **polygon_list**. | + +-----------+---------+-----------------------------------------------------------+ + + Example: + + .. code-block:: + + select longitude, latitude from geosot where IN_POLYGON_LIST('POLYGON ((120.176433 30.327431,120.171283 30.322245,120.181411 30.314540, 120.190509 30.321653,120.185188 30.329358,120.176433 30.327431)), POLYGON ((120.191603 30.328946,120.184179 30.327465,120.181819 30.321464, 120.190359 30.315388,120.199242 30.324464,120.191603 30.328946))', 'OR'); + +- Filtering data based on the polyline list + + **IN_POLYLINE_LIST(polylineList, bufferInMeter)** + + UDF input parameters + + +-----------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Type | Description | + +=======================+=======================+==========================================================================================================================================================================================================================================================================================================+ + | polylineList | String | Inputs multiple polylines as a string. Each polyline is presented as **LINESTRING (longitude1 latitude1, longitude2 latitude2, …)**. Note that there is a space after **LINESTRING**. Longitudes and latitudes are separated by spaces. Each pair of longitude and latitude is separated by a comma (,). | + | | | | + | | | A union will be output based on the data in multiple polylines. | + | | | | + | | | Example: | + | | | | + | | | .. code-block:: | + | | | | + | | | LINESTRING (116.137676 40.163503, 116.137676 39.935276, 116.260993 39.935276) | + +-----------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | bufferInMeter | Float | Polyline buffer distance, in meters. Right angles are used at the end to create a buffer. | + +-----------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + UDF output parameter + + +-----------+---------+------------------------------------------------------------+ + | Parameter | Type | Description | + +===========+=========+============================================================+ + | inOrNot | Boolean | Checks whether data is in the specified **polyline_list**. | + +-----------+---------+------------------------------------------------------------+ + + Example: + + .. code-block:: + + select longitude, latitude from geosot where IN_POLYLINE_LIST('LINESTRING (120.184179 30.327465, 120.191603 30.328946, 120.199242 30.324464, 120.190359 30.315388)', 65); + +- Filtering data based on the GeoID range list + + **IN_POLYGON_RANGE_LIST(polygonRangeList, opType)** + + UDF input parameters + + +-----------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Type | Description | + +=======================+=======================+======================================================================================================================================================================================================================================================================================================+ + | polygonRangeList | String | Inputs multiple rangeLists as a string. Each rangeList is presented as **RANGELIST (startGeoId1 endGeoId1, startGeoId2 endGeoId2, …)**. Note that there is a space after **RANGELIST**. Start GeoIDs and end GeoIDs are separated by spaces. Each group of GeoID ranges is separated by a comma (,). | + | | | | + | | | Example: | + | | | | + | | | .. code-block:: | + | | | | + | | | RANGELIST (855279368848 855279368850, 855280799610 855280799612, 855282156300 855282157400) | + +-----------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | opType | String | Performs union, intersection, and subtraction on multiple rangeLists. | + | | | | + | | | Currently, the following operation types are supported: | + | | | | + | | | - OR: A U B U C (Assume that three rangeLists A, B, and C are input.) | + | | | - AND: A ∩ B ∩ C | + +-----------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + UDF output parameter + + +-----------+---------+-------------------------------------------------------------+ + | Parameter | Type | Description | + +===========+=========+=============================================================+ + | inOrNot | Boolean | Checks whether data is in the specified **polyRange_list**. | + +-----------+---------+-------------------------------------------------------------+ + + Example: + + .. code-block:: + + select mygeosot, longitude, latitude from geosot where IN_POLYGON_RANGE_LIST('RANGELIST (526549722865860608 526549722865860618, 532555655580483584 532555655580483594)', 'OR'); + +- Performing polygon query + + **IN_POLYGON_JOIN(GEO_HASH_INDEX_COLUMN, POLYGON_COLUMN)** + + Perform join query on two tables. One is a spatial data table containing the longitude, latitude, and GeoHashIndex columns, and the other is a dimension table that saves polygon data. + + During query, **IN_POLYGON_JOIN UDF**, **GEO_HASH_INDEX_COLUMN**, and **POLYGON_COLUMN** of the polygon table are used. **Polygon_column** specifies the column containing multiple points (longitude and latitude pairs). The first and last points in each row of the Polygon table must be the same. All points in each row form a closed geometric shape. + + UDF input parameters + + +-----------------------+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Type | Description | + +=======================+========+=================================================================================================================================================================================+ + | GEO_HASH_INDEX_COLUMN | Long | GeoHashIndex column of the spatial data table. | + +-----------------------+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | POLYGON_COLUMN | String | Polygon column of the polygon table, the value of which is represented by the string of polygon, for example, **POLYGON (( longitude1 latitude1, longitude2 latitude2, ...))**. | + +-----------------------+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + Example: + + .. code-block:: + + CREATE TABLE polygonTable( + polygon string, + poiType string, + poiId String) + STORED AS carbondata; + + insert into polygonTable select 'POLYGON ((120.176433 30.327431,120.171283 30.322245, 120.181411 30.314540,120.190509 30.321653,120.185188 30.329358,120.176433 30.327431))','abc','1'; + + insert into polygonTable select 'POLYGON ((120.191603 30.328946,120.184179 30.327465, 120.181819 30.321464,120.190359 30.315388,120.199242 30.324464,120.191603 30.328946))','abc','2'; + + select t1.longitude,t1.latitude from geosot t1 + inner join + (select polygon,poiId from polygonTable where poitype='abc') t2 + on in_polygon_join(t1.mygeosot,t2.polygon) group by t1.longitude,t1.latitude; + +- Performing range_list query + + **IN_POLYGON_JOIN_RANGE_LIST(GEO_HASH_INDEX_COLUMN, POLYGON_COLUMN)** + + Use the **IN_POLYGON_JOIN_RANGE_LIST** UDF to associate the spatial data table with the polygon dimension table based on **Polygon_RangeList**. By using a range list, you can skip the conversion between a polygon and a range list. + + UDF input parameters + + +-----------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Type | Description | + +=======================+========+======================================================================================================================================================================================+ + | GEO_HASH_INDEX_COLUMN | Long | GeoHashIndex column of the spatial data table. | + +-----------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | POLYGON_COLUMN | String | Rangelist column of the Polygon table, the value of which is represented by the string of rangeList, for example, **RANGELIST (startGeoId1 endGeoId1, startGeoId2 endGeoId2, ...)**. | + +-----------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + Example: + + .. code-block:: + + CREATE TABLE polygonTable( + polygon string, + poiType string, + poiId String) + STORED AS carbondata; + + insert into polygonTable select 'RANGELIST (526546455897309184 526546455897309284, 526549831217315840 526549831217315850, 532555655580483534 532555655580483584)','xyz','2'; + + select t1.* + from geosot t1 + inner join + (select polygon,poiId from polygonTable where poitype='xyz') t2 + on in_polygon_join_range_list(t1.mygeosot,t2.polygon); + +**UDFs of spacial index tools** + +- Obtaining row number and column number of a grid converted from GeoID + + **GeoIdToGridXy(geoId)** + + UDF input parameter + + +-----------+------+-------------------------------------------------------------------------+ + | Parameter | Type | Description | + +===========+======+=========================================================================+ + | geoId | Long | Calculates the row number and column number of the grid based on GeoID. | + +-----------+------+-------------------------------------------------------------------------+ + + UDF output parameter + + +-----------+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Type | Description | + +===========+============+==================================================================================================================================================================+ + | gridArray | Array[Int] | Returns the grid row and column numbers contained in GeoID in array. The first digit indicates the row number, and the second digit indicates the column number. | + +-----------+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + Example: + + .. code-block:: + + select longitude, latitude, mygeohash, GeoIdToGridXy(mygeohash) as GridXY from geoTable; + +- Converting longitude and latitude to GeoID + + **LatLngToGeoId(latitude, longitude oriLatitude, gridSize)** + + UDF input parameters + + +-------------+--------+------------------------------------------------------------+ + | Parameter | Type | Description | + +=============+========+============================================================+ + | longitude | Long | Longitude. Note: The value is an integer after conversion. | + +-------------+--------+------------------------------------------------------------+ + | latitude | Long | Latitude. Note: The value is an integer after conversion. | + +-------------+--------+------------------------------------------------------------+ + | oriLatitude | Double | Origin latitude, required for calculating GeoID. | + +-------------+--------+------------------------------------------------------------+ + | gridSize | Int | Grid size, required for calculating GeoID. | + +-------------+--------+------------------------------------------------------------+ + + UDF output parameter + + +-----------+------+--------------------------------------------------------------------------+ + | Parameter | Type | Description | + +===========+======+==========================================================================+ + | geoId | Long | Returns a number that indicates the longitude and latitude after coding. | + +-----------+------+--------------------------------------------------------------------------+ + + Example: + + .. code-block:: + + select longitude, latitude, mygeohash, LatLngToGeoId(latitude, longitude, 39.832277, 50) as geoId from geoTable; + +- Converting GeoID to longitude and latitude + + **GeoIdToLatLng(geoId, oriLatitude, gridSize)** + + UDF input parameters + + +-------------+--------+-----------------------------------------------------------------------+ + | Parameter | Type | Description | + +=============+========+=======================================================================+ + | geoId | Long | Calculates the longitude and latitude based on GeoID. | + +-------------+--------+-----------------------------------------------------------------------+ + | oriLatitude | Double | Origin latitude, required for calculating the longitude and latitude. | + +-------------+--------+-----------------------------------------------------------------------+ + | gridSize | Int | Grid size, required for calculating the longitude and latitude. | + +-------------+--------+-----------------------------------------------------------------------+ + + .. note:: + + GeoID is generated based on the grid coordinates, which are the grid center. Therefore, the calculated longitude and latitude are the longitude and latitude of the grid center. There may be an error ranging from 0 degree to half of the grid size between the calculated longitude and latitude and the longitude and latitude of the generated GeoID. + + UDF output parameter + + +----------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Type | Description | + +======================+===============+============================================================================================================================================================================================+ + | latitudeAndLongitude | Array[Double] | Returns the longitude and latitude coordinates of the grid center that represent the GeoID in array. The first digit indicates the latitude, and the second digit indicates the longitude. | + +----------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + Example: + + .. code-block:: + + select longitude, latitude, mygeohash, GeoIdToLatLng(mygeohash, 39.832277, 50) as LatitudeAndLongitude from geoTable; + +- Calculating the upper-layer GeoID of the pyramid model + + **ToUpperLayerGeoId(geoId)** + + UDF input parameter + + +-----------+------+---------------------------------------------------------------------------------+ + | Parameter | Type | Description | + +===========+======+=================================================================================+ + | geoId | Long | Calculates the upper-layer GeoID of the pyramid model based on the input GeoID. | + +-----------+------+---------------------------------------------------------------------------------+ + + UDF output parameter + + ========= ==== =================================================== + Parameter Type Description + ========= ==== =================================================== + geoId Long Returns the upper-layer GeoID of the pyramid model. + ========= ==== =================================================== + + Example: + + .. code-block:: + + select longitude, latitude, mygeohash, ToUpperLayerGeoId(mygeohash) as upperLayerGeoId from geoTable; + +- Obtaining the GeoID range list using the input polygon + + **ToRangeList(polygon, oriLatitude, gridSize)** + + UDF input parameters + + +-----------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Type | Description | + +=======================+=======================+=====================================================================================================================================================================================+ + | polygon | String | Input polygon string, which is a pair of longitude and latitude. | + | | | | + | | | Longitude and latitude are separated by a space. Each pair of longitude and latitude is separated by a comma (,). The longitude and latitude at the start and end must be the same. | + +-----------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | oriLatitude | Double | Origin latitude, required for calculating GeoID. | + +-----------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | gridSize | Int | Grid size, required for calculating GeoID. | + +-----------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + UDF output parameter + + ========= =================== ========================================= + Parameter Type Description + ========= =================== ========================================= + geoIdList Buffer[Array[Long]] Converts polygons into GeoID range lists. + ========= =================== ========================================= + + Example: + + .. code-block:: + + select ToRangeList('116.321011 40.123503, 116.137676 39.947911, 116.560993 39.935276, 116.321011 40.123503', 39.832277, 50) as rangeList from geoTable; + +- Calculating the upper-layer longitude of the pyramid model + + **ToUpperLongitude (longitude, gridSize, oriLat)** + + UDF input parameters + + =========== ====== ==================================================== + Parameter Type Description + =========== ====== ==================================================== + longitude Long Input longitude, which is a long integer. + gridSize Int Grid size, required for calculating longitude. + oriLatitude Double Origin latitude, required for calculating longitude. + =========== ====== ==================================================== + + UDF output parameter + + ========= ==== ================================== + Parameter Type Description + ========= ==== ================================== + longitude Long Returns the upper-layer longitude. + ========= ==== ================================== + + Example: + + .. code-block:: + + select ToUpperLongitude (-23575161504L, 50, 39.832277) as upperLongitude from geoTable; + +- Calculating the upper-layer latitude of the pyramid model + + **ToUpperLatitude(Latitude, gridSize, oriLat)** + + UDF input parameters + + =========== ====== =================================================== + Parameter Type Description + =========== ====== =================================================== + latitude Long Input latitude, which is a long integer. + gridSize Int Grid size, required for calculating latitude. + oriLatitude Double Origin latitude, required for calculating latitude. + =========== ====== =================================================== + + UDF output parameter + + ========= ==== ================================= + Parameter Type Description + ========= ==== ================================= + Latitude Long Returns the upper-layer latitude. + ========= ==== ================================= + + Example: + + .. code-block:: + + select ToUpperLatitude (-23575161504L, 50, 39.832277) as upperLatitude from geoTable; + +- Converting longitude and latitude to GeoSOT + + **LatLngToGridCode(latitude, longitude, level)** + + UDF input parameters + + ========= ====== ================================== + Parameter Type Description + ========= ====== ================================== + latitude Double Latitude. + longitude Double Longitude. + level Int Level. The value range is [0, 32]. + ========= ====== ================================== + + UDF output parameter + + +-----------+------+---------------------------------------------------------------------------+ + | Parameter | Type | Description | + +===========+======+===========================================================================+ + | geoId | Long | A number that indicates the longitude and latitude after GeoSOT encoding. | + +-----------+------+---------------------------------------------------------------------------+ + + Example: + + .. code-block:: + + select LatLngToGridCode(39.930753, 116.302895, 21) as geoId; + +.. |image1| image:: /_static/images/en-us_image_0000001296090372.png diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_troubleshooting/filter_result_is_not_consistent_with_hive_when_a_big_double_type_value_is_used_in_filter.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_troubleshooting/filter_result_is_not_consistent_with_hive_when_a_big_double_type_value_is_used_in_filter.rst new file mode 100644 index 0000000..861ed33 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_troubleshooting/filter_result_is_not_consistent_with_hive_when_a_big_double_type_value_is_used_in_filter.rst @@ -0,0 +1,31 @@ +:original_name: mrs_01_1455.html + +.. _mrs_01_1455: + +Filter Result Is not Consistent with Hive when a Big Double Type Value Is Used in Filter +======================================================================================== + +Symptom +------- + +When double data type values with higher precision are used in filters, incorrect values are returned by filtering results. + +Possible Causes +--------------- + +When double data type values with higher precision are used in filters, values are rounded off before comparison. Therefore, values of double data type with different fraction part are considered same. + +Troubleshooting Method +---------------------- + +NA. + +Procedure +--------- + +To avoid this problem, use decimal data type when high precision data comparisons are required, such as financial applications, equality and inequality checks, and rounding operations. + +Reference Information +--------------------- + +NA. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_troubleshooting/index.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_troubleshooting/index.rst new file mode 100644 index 0000000..5499587 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_troubleshooting/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_1454.html + +.. _mrs_01_1454: + +CarbonData Troubleshooting +========================== + +- :ref:`Filter Result Is not Consistent with Hive when a Big Double Type Value Is Used in Filter ` +- :ref:`Query Performance Deterioration ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + filter_result_is_not_consistent_with_hive_when_a_big_double_type_value_is_used_in_filter + query_performance_deterioration diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_troubleshooting/query_performance_deterioration.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_troubleshooting/query_performance_deterioration.rst new file mode 100644 index 0000000..4d8df8a --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/carbondata_troubleshooting/query_performance_deterioration.rst @@ -0,0 +1,35 @@ +:original_name: mrs_01_1456.html + +.. _mrs_01_1456: + +Query Performance Deterioration +=============================== + +Symptom +------- + +The query performance fluctuates when the query is executed in different query periods. + +Possible Causes +--------------- + +During data loading, the memory configured for each executor program instance may be insufficient, resulting in more Java GCs. When GC occurs, the query performance deteriorates. + +Troubleshooting Method +---------------------- + +On the Spark UI, the GC time of some executors is obviously higher than that of other executors, or all executors have high GC time. + +Procedure +--------- + +Log in to Manager and choose **Cluster** > **Services** > **Spark2x**. On the displayed page, click the **Configurations** tab and then **All Configurations**, search for **spark.executor.memory** in the search box, and set its value to a larger value. + +|image1| + +Reference +--------- + +None + +.. |image1| image:: /_static/images/en-us_image_0000001349170105.png diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/configuration_reference.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/configuration_reference.rst new file mode 100644 index 0000000..6357e9d --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/configuration_reference.rst @@ -0,0 +1,330 @@ +:original_name: mrs_01_1404.html + +.. _mrs_01_1404: + +Configuration Reference +======================= + +This section provides the details of all the configurations required for the CarbonData System. + +.. table:: **Table 1** System configurations in **carbon.properties** + + +----------------------------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +============================+===========================+==================================================================================================================================================================================================================================================================================================+ + | carbon.ddl.base.hdfs.url | hdfs://hacluster/opt/data | HDFS relative path from the HDFS base path, which is configured in **fs.defaultFS**. The path configured in **carbon.ddl.base.hdfs.url** will be appended to the HDFS path configured in **fs.defaultFS**. If this path is configured, you do not need to pass the complete path while dataload. | + | | | | + | | | For example, if the absolute path of the CSV file is **hdfs://10.18.101.155:54310/data/cnbc/2016/xyz.csv**, the path **hdfs://10.18.101.155:54310** will come from property **fs.defaultFS** and you can configure **/data/cnbc/** as **carbon.ddl.base.hdfs.url**. | + | | | | + | | | During data loading, you can specify the CSV path as **/2016/xyz.csv**. | + +----------------------------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.badRecords.location | ``-`` | Storage path of bad records. This path is an HDFS path. The default value is **Null**. If bad records logging or bad records operation redirection is enabled, the path must be configured by the user. | + +----------------------------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.bad.records.action | fail | The following are four types of actions for bad records: | + | | | | + | | | **FORCE**: Data is automatically corrected by storing the bad records as NULL. | + | | | | + | | | **REDIRECT**: Bad records are written to the raw CSV instead of being loaded. | + | | | | + | | | **IGNORE**: Bad records are neither loaded nor written to the raw CSV. | + | | | | + | | | **FAIL**: Data loading fails if any bad records are found. | + +----------------------------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.update.sync.folder | /tmp/carbondata | Specifies the **modifiedTime.mdt** file path. You can set it to an existing path or a new path. | + | | | | + | | | .. note:: | + | | | | + | | | If you set this parameter to an existing path, ensure that all users can access the path and the path has the 777 permission. | + +----------------------------+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _mrs_01_1404__t197b7a04db3c4f919bd30707c2fdcd1f: + +.. table:: **Table 2** Performance configurations in **carbon.properties** + + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +==================================================+=======================+===================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | **Data Loading Configuration** | | | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.sort.file.write.buffer.size | 16384 | CarbonData sorts data and writes it to a temporary file to limit memory usage. This parameter controls the size of the buffer used for reading and writing temporary files. The unit is bytes. | + | | | | + | | | The value ranges from 10240 to 10485760. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.graph.rowset.size | 100,000 | Rowset size exchanged in data loading graph steps. | + | | | | + | | | The value ranges from 500 to 1,000,000. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.number.of.cores.while.loading | 6 | Number of cores used during data loading. The greater the number of cores, the better the compaction performance. If the CPU resources are sufficient, you can increase the value of this parameter. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.sort.size | 500000 | Number of records to be sorted | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.enableXXHash | true | Hashmap algorithm used for hashkey calculation | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.number.of.cores.block.sort | 7 | Number of cores used for sorting blocks during data loading | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.max.driver.lru.cache.size | -1 | Maximum size of LRU caching for data loading at the driver side. The unit is MB. The default value is **-1**, indicating that there is no memory limit for the caching. Only integer values greater than 0 are accepted. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.max.executor.lru.cache.size | -1 | Maximum size of LRU caching for data loading at the executor side. The unit is MB. The default value is **-1**, indicating that there is no memory limit for the caching. Only integer values greater than 0 are accepted. If this parameter is not configured, the value of **carbon.max.driver.lru.cache.size** is used. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.merge.sort.prefetch | true | Whether to enable prefetch of data during merge sort while reading data from sorted temp files in the process of data loading | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.update.persist.enable | true | Configuration to enable the dataset of RDD/dataframe to persist data. Enabling this will reduce the execution time of UPDATE operation. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | enable.unsafe.sort | true | Whether to use unsafe sort during data loading. Unsafe sort reduces the garbage collection during data load operation, resulting in better performance. The default value is **true**, indicating that unsafe sort is enabled. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | enable.offheap.sort | true | Whether to use off-heap memory for sorting of data during data loading | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | offheap.sort.chunk.size.inmb | 64 | Size of data chunks to be sorted, in MB. The value ranges from 1 to 1024. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.unsafe.working.memory.in.mb | 512 | Size of the unsafe working memory. This will be used for sorting data and storing column pages. The unit is MB. | + | | | | + | | | Memory required for data loading: | + | | | | + | | | carbon.number.of.cores.while.loading [default value is 6] x Number of tables to load in parallel x offheap.sort.chunk.size.inmb [default value is 64 MB] + carbon.blockletgroup.size.in.mb [default value is 64 MB] + Current compaction ratio [64 MB/3.5]) | + | | | | + | | | = Around 900 MB per table | + | | | | + | | | Memory required for data query: | + | | | | + | | | (SPARK_EXECUTOR_INSTANCES. [default value is 2] x (carbon.blockletgroup.size.in.mb [default value: 64 MB] + carbon.blockletgroup.size.in.mb [default value = 64 MB x 3.5) x Number of cores per executor [default value: 1]) | + | | | | + | | | = ~ 600 MB | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.sort.inmemory.storage.size.in.mb | 512 | Size of the intermediate sort data to be kept in the memory. Once the specified value is reached, the system writes data to the disk. The unit is MB. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | sort.inmemory.size.inmb | 1024 | Size of the intermediate sort data to be kept in the memory. Once the specified value is reached, the system writes data to the disk. The unit is MB. | + | | | | + | | | If **carbon.unsafe.working.memory.in.mb** and **carbon.sort.inmemory.storage.size.in.mb** are configured, you do not need to set this parameter. If this parameter has been configured, 20% of the memory is used for working memory **carbon.unsafe.working.memory.in.mb**, and 80% is used for sort storage memory **carbon.sort.inmemory.storage.size.in.mb**. | + | | | | + | | | .. note:: | + | | | | + | | | The value of **spark.yarn.executor.memoryOverhead** configured for Spark must be greater than the value of **sort.inmemory.size.inmb** configured for CarbonData. Otherwise, Yarn might stop the executor if off-heap access exceeds the configured executor memory. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.blockletgroup.size.in.mb | 64 | The data is read as a group of blocklets which are called blocklet groups. This parameter specifies the size of each blocklet group. Higher value results in better sequential I/O access. | + | | | | + | | | The minimum value is 16 MB. Any value less than 16 MB will be reset to the default value (64 MB). | + | | | | + | | | The unit is MB. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | enable.inmemory.merge.sort | false | Whether to enable **inmemorymerge sort**. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | use.offheap.in.query.processing | true | Whether to enable **offheap** in query processing. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.load.sort.scope | local_sort | Sort scope for the load operation. There are two types of sort: **batch_sort** and **local_sort**. If **batch_sort** is selected, the loading performance is improved but the query performance is reduced. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.batch.sort.size.inmb | ``-`` | Size of data to be considered for batch sorting during data loading. The recommended value is less than 45% of the total sort data. The unit is MB. | + | | | | + | | | .. note:: | + | | | | + | | | If this parameter is not set, its value is about 45% of the value of **sort.inmemory.size.inmb** by default. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | enable.unsafe.columnpage | true | Whether to keep page data in heap memory during data loading or query to prevent garbage collection bottleneck. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.use.local.dir | false | Whether to use Yarn local directories for multi-disk data loading. If this parameter is set to **true**, Yarn local directories are used to load multi-disk data to improve data loading performance. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.use.multiple.temp.dir | false | Whether to use multiple temporary directories for storing temporary files to improve data loading performance. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.load.datamaps.parallel.db_name.table_name | N/A | The value can be **true** or **false**. You can set the database name and table name to improve the first query performance of the table. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Compaction Configuration** | | | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.number.of.cores.while.compacting | 2 | Number of cores to be used while compacting data. The greater the number of cores, the better the compaction performance. If the CPU resources are sufficient, you can increase the value of this parameter. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.compaction.level.threshold | 4,3 | This configuration is for minor compaction which decides how many segments to be merged. | + | | | | + | | | For example, if this parameter is set to **2,3**, minor compaction is triggered every two segments. **3** is the number of level 1 compacted segments which is further compacted to new segment. | + | | | | + | | | The value ranges from 0 to 100. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.major.compaction.size | 1024 | Major compaction size. Sum of the segments which is below this threshold will be merged. | + | | | | + | | | The unit is MB. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.horizontal.compaction.enable | true | Whether to enable/disable horizontal compaction. After every DELETE and UPDATE statement, horizontal compaction may occur in case the incremental (DELETE/ UPDATE) files becomes more than specified threshold. By default, this parameter is set to **true**. You can set this parameter to **false** to disable horizontal compaction. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.horizontal.update.compaction.threshold | 1 | Threshold limit on number of UPDATE delta files within a segment. In case the number of delta files goes beyond the threshold, the UPDATE delta files within the segment becomes eligible for horizontal compaction and are compacted into single UPDATE delta file. By default, this parameter is set to **1**. The value ranges from **1** to **10000**. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.horizontal.delete.compaction.threshold | 1 | Threshold limit on number of DELETE incremental files within a block of a segment. In case the number of incremental files goes beyond the threshold, the DELETE incremental files for the particular block of the segment becomes eligible for horizontal compaction and are compacted into single DELETE incremental file. By default, this parameter is set to **1**. The value ranges from **1** to **10000**. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Query Configuration** | | | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.number.of.cores | 4 | Number of cores to be used during query | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.limit.block.distribution.enable | false | Whether to enable the CarbonData distribution for limit query. The default value is **false**, indicating that block distribution is disabled for query statements that contain the keyword limit. For details about how to optimize this parameter, see :ref:`Configurations for Performance Tuning `. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.custom.block.distribution | false | Whether to enable Spark or CarbonData block distribution. By default, the value is **false**, indicating that Spark block distribution is enabled. To enable CarbonData block distribution, change the value to **true**. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.infilter.subquery.pushdown.enable | false | If this is set to **true** and a Select query is triggered in the filter with subquery, the subquery is executed and the output is broadcast as IN filter to the left table. Otherwise, SortMergeSemiJoin is executed. You are advised to set this to **true** when IN filter subquery does not return too many records. For example, when the IN sub-sentence query returns 10,000 or fewer records, enabling this parameter will give the query results faster. | + | | | | + | | | Example: **select \* from flow_carbon_256b where cus_no in (select cus_no from flow_carbon_256b where dt>='20260101' and dt<='20260701' and txn_bk='tk_1' and txn_br='tr_1') limit 1000;** | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.scheduler.minRegisteredResourcesRatio | 0.8 | Minimum resource (executor) ratio needed for starting the block distribution. The default value is **0.8**, indicating that 80% of the requested resources are allocated for starting block distribution. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.dynamicAllocation.schedulerTimeout | 5 | Maximum time that the scheduler waits for executors to be active. The default value is **5** seconds, and the maximum value is **15** seconds. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | enable.unsafe.in.query.processing | true | Whether to use unsafe sort during query. Unsafe sort reduces the garbage collection during query, resulting in better performance. The default value is **true**, indicating that unsafe sort is enabled. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.enable.vector.reader | true | Whether to enable vector processing for result collection to improve query performance | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.query.show.datamaps | true | **SHOW TABLES** lists all tables including the primary table and datamaps. To filter out the datamaps, set this parameter to **false**. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Secondary Index Configuration** | | | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.secondary.index.creation.threads | 1 | Number of threads to concurrently process segments during secondary index creation. This property helps fine-tuning the system when there are a lot of segments in a table. The value ranges from 1 to 50. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.si.lookup.partialstring | true | - When the parameter value is **true**, it includes indexes started with, ended with, and contained. | + | | | - When the parameter value is **false**, it includes only secondary indexes started with. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.si.segment.merge | true | Enabling this property merges **.carbondata** files inside the secondary index segment. The merging will happen after the load operation. That is, at the end of the secondary index table load, small files are checked and merged. | + | | | | + | | | .. note:: | + | | | | + | | | Table Block Size is used as the size threshold for merging small files. | + +--------------------------------------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. table:: **Table 3** Other configurations in **carbon.properties** + + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +==========================================+==========================+===========================================================================================================================================================================================================================================================================================+ + | **Data Loading Configuration** | | | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.lock.type | HDFSLOCK | Type of lock to be acquired during concurrent operations on a table. | + | | | | + | | | There are following types of lock implementation: | + | | | | + | | | - **LOCALLOCK**: Lock is created on local file system as a file. This lock is useful when only one Spark driver (or JDBCServer) runs on a machine. | + | | | - **HDFSLOCK**: Lock is created on HDFS file system as a file. This lock is useful when multiple Spark applications are running and no ZooKeeper is running on a cluster. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.sort.intermediate.files.limit | 20 | Minimum number of intermediate files. After intermediate files are generated, sort and merge the files. For details about how to optimize this parameter, see :ref:`Configurations for Performance Tuning `. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.csv.read.buffersize.byte | 1048576 | Size of CSV reading buffer | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.merge.sort.reader.thread | 3 | Maximum number of threads used for reading intermediate files for final merging. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.concurrent.lock.retries | 100 | Maximum number of retries used to obtain the concurrent operation lock. This parameter is used for concurrent loading. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.concurrent.lock.retry.timeout.sec | 1 | Interval between the retries to obtain the lock for concurrent operations. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.lock.retries | 3 | Maximum number of retries to obtain the lock for any operations other than import. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.lock.retry.timeout.sec | 5 | Interval between the retries to obtain the lock for any operation other than import. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.tempstore.location | /opt/Carbon/TempStoreLoc | Temporary storage location. By default, the **System.getProperty("java.io.tmpdir")** method is used to obtain the value. For details about how to optimize this parameter, see the description of **carbon.use.local.dir** in :ref:`Configurations for Performance Tuning `. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.load.log.counter | 500000 | Data loading records count in logs | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | SERIALIZATION_NULL_FORMAT | \\N | Value to be replaced with NULL | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.skip.empty.line | false | Setting this property will ignore the empty lines in the CSV file during data loading. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.load.datamaps.parallel | false | Whether to enable parallel datamap loading for all tables in all sessions. This property will improve the time to load datamaps into memory by distributing the job among executors, thus improving query performance. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Merging Configuration** | | | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.numberof.preserve.segments | 0 | If you want to preserve some number of segments from being compacted, then you can set this configuration. | + | | | | + | | | For example, if **carbon.numberof.preserve.segments** is set to **2**, the latest two segments will always be excluded from the compaction. | + | | | | + | | | No segments will be preserved by default. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.allowed.compaction.days | 0 | This configuration is used to control on the number of recent segments that needs to be merged. | + | | | | + | | | For example, if this parameter is set to **2**, the segments which are loaded in the time frame of past 2 days only will get merged. Segments which are loaded earlier than 2 days will not be merged. | + | | | | + | | | This configuration is disabled by default. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.enable.auto.load.merge | false | Whether to enable compaction along with data loading. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.merge.index.in.segment | true | This configuration enables to merge all the CarbonIndex files (**.carbonindex**) into a single MergeIndex file (**.carbonindexmerge**) upon data loading completion. This significantly reduces the delay in serving the first query. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **Query Configuration** | | | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | max.query.execution.time | 60 | Maximum time allowed for one query to be executed. | + | | | | + | | | The unit is minute. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.enableMinMax | true | MinMax is used to improve query performance. You can set this to **false** to disable this function. | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.lease.recovery.retry.count | 5 | Maximum number of attempts that need to be made for recovering a lease on a file. | + | | | | + | | | Minimum value: **1** | + | | | | + | | | Maximum value: **50** | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | carbon.lease.recovery.retry.interval | 1000 (ms) | Interval or pause time after a lease recovery attempt is made on a file. | + | | | | + | | | Minimum value: **1000** (ms) | + | | | | + | | | Maximum value: **10000** (ms) | + +------------------------------------------+--------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. table:: **Table 4** Spark configuration reference in **spark-defaults.conf** + + +-----------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=============================+=======================+=====================================================================================================================================================================================================================================================+ + | spark.driver.memory | 4G | Memory to be used for the driver process. SparkContext has been initialized. | + | | | | + | | | .. note:: | + | | | | + | | | In client mode, do not use SparkConf to set this parameter in the application because the driver JVM has been started. To configure this parameter, configure it in the **--driver-memory** command-line option or in the default property file. | + +-----------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spark.executor.memory | 4 GB | Memory to be used for each executor process. | + +-----------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spark.sql.crossJoin.enabled | true | If the query contains a cross join, enable this property so that no error is thrown. In this case, you can use a cross join instead of a join for better performance. | + +-----------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Configure the following parameters in the **spark-defaults.conf** file on the Spark driver. + +- In spark-sql mode: + + .. _mrs_01_1404__ta902cd071dfb426097416a5c7034ee6c: + + .. table:: **Table 5** Parameter description + + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Value | Description | + +========================================+==============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+=============================================================================================================================================================================================================================================================+ + | spark.driver.extraJavaOptions | -Dlog4j.configuration=file:/opt/client/Spark2x/spark/conf/log4j.properties -Djetty.version=x.y.z -Dzookeeper.server.principal=zookeeper/hadoop.\ ** -Djava.security.krb5.conf=/opt/client/KrbClient/kerberos/var/krb5kdc/krb5.conf -Djava.security.auth.login.config=/opt/client/Spark2x/spark/conf/jaas.conf -Dorg.xerial.snappy.tempdir=/opt/client/Spark2x/tmp -Dcarbon.properties.filepath=/opt/client/Spark2x/spark/conf/carbon.properties -Djava.io.tmpdir=/opt/client/Spark2x/tmp | The default value **/opt/client/Spark2x/spark** indicates **CLIENT_HOME** of the client and is added to the end of the value of **spark.driver.extraJavaOptions**. This parameter is used to specify the path of the **carbon.properties**\ file in Driver. | + | | | | + | | | .. note:: | + | | | | + | | | Spaces next to equal marks (=) are not allowed. | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spark.sql.session.state.builder | org.apache.spark.sql.hive.FIHiveACLSessionStateBuilder | Session state constructor. | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spark.carbon.sqlastbuilder.classname | org.apache.spark.sql.hive.CarbonInternalSqlAstBuilder | AST constructor. | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spark.sql.catalog.class | org.apache.spark.sql.hive.HiveACLExternalCatalog | Hive External catalog to be used. This parameter is mandatory if Spark ACL is enabled. | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spark.sql.hive.implementation | org.apache.spark.sql.hive.HiveACLClientImpl | How to call the Hive client. This parameter is mandatory if Spark ACL is enabled. | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spark.sql.hiveClient.isolation.enabled | false | This parameter is mandatory if Spark ACL is enabled. | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- In JDBCServer mode: + + .. _mrs_01_1404__t3897ae14f205433fb0f98b79411cfa0c: + + .. table:: **Table 6** Parameter description + + +----------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Value | Description | + +========================================+===========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+========================================================================================================================================================================================================================================================+ + | spark.driver.extraJavaOptions | -Xloggc:${SPARK_LOG_DIR}/indexserver-omm-%p-gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=512M -XX:MaxMetaspaceSize=512M -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -XX:OnOutOfMemoryError='kill -9 %p' -Djetty.version=x.y.z -Dorg.xerial.snappy.tempdir=${BIGDATA_HOME}/tmp/spark2x/JDBCServer/snappy_tmp -Djava.io.tmpdir=${BIGDATA_HOME}/tmp/spark2x/JDBCServer/io_tmp -Dcarbon.properties.filepath=${SPARK_CONF_DIR}/carbon.properties -Djdk.tls.ephemeralDHKeySize=2048 -Dspark.ssl.keyStore=${SPARK_CONF_DIR}/child.keystore #{java_stack_prefer} | The default value **${SPARK_CONF_DIR}** depends on a specific cluster and is added to the end of the value of the **spark.driver.extraJavaOptions** parameter. This parameter is used to specify the path of the **carbon.properties** file in Driver. | + | | | | + | | | .. note:: | + | | | | + | | | Spaces next to equal marks (=) are not allowed. | + +----------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spark.sql.session.state.builder | org.apache.spark.sql.hive.FIHiveACLSessionStateBuilder | Session state constructor. | + +----------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spark.carbon.sqlastbuilder.classname | org.apache.spark.sql.hive.CarbonInternalSqlAstBuilder | AST constructor. | + +----------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spark.sql.catalog.class | org.apache.spark.sql.hive.HiveACLExternalCatalog | Hive External catalog to be used. This parameter is mandatory if Spark ACL is enabled. | + +----------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spark.sql.hive.implementation | org.apache.spark.sql.hive.HiveACLClientImpl | How to call the Hive client. This parameter is mandatory if Spark ACL is enabled. | + +----------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spark.sql.hiveClient.isolation.enabled | false | This parameter is mandatory if Spark ACL is enabled. | + +----------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/index.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/index.rst new file mode 100644 index 0000000..e677726 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/index.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_1400.html + +.. _mrs_01_1400: + +Using CarbonData (for MRS 3.x or Later) +======================================= + +- :ref:`Overview ` +- :ref:`Configuration Reference ` +- :ref:`CarbonData Operation Guide ` +- :ref:`CarbonData Performance Tuning ` +- :ref:`CarbonData Access Control ` +- :ref:`CarbonData Syntax Reference ` +- :ref:`CarbonData Troubleshooting ` +- :ref:`CarbonData FAQ ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview/index + configuration_reference + carbondata_operation_guide/index + carbondata_performance_tuning/index + carbondata_access_control + carbondata_syntax_reference/index + carbondata_troubleshooting/index + carbondata_faq/index diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/overview/carbondata_overview.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/overview/carbondata_overview.rst new file mode 100644 index 0000000..b71e539 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/overview/carbondata_overview.rst @@ -0,0 +1,51 @@ +:original_name: mrs_01_1402.html + +.. _mrs_01_1402: + +CarbonData Overview +=================== + +CarbonData is a new Apache Hadoop native data-store format. CarbonData allows faster interactive queries over PetaBytes of data using advanced columnar storage, index, compression, and encoding techniques to improve computing efficiency. In addition, CarbonData is also a high-performance analysis engine that integrates data sources with Spark. + + +.. figure:: /_static/images/en-us_image_0000001295930452.png + :alt: **Figure 1** Basic architecture of CarbonData + + **Figure 1** Basic architecture of CarbonData + +The purpose of using CarbonData is to provide quick response to ad hoc queries of big data. Essentially, CarbonData is an Online Analytical Processing (OLAP) engine, which stores data by using tables similar to those in Relational Database Management System (RDBMS). You can import more than 10 TB data to tables created in CarbonData format, and CarbonData automatically organizes and stores data using the compressed multi-dimensional indexes. After data is loaded to CarbonData, CarbonData responds to ad hoc queries in seconds. + +CarbonData integrates data sources into the Spark ecosystem and you can query and analyze the data using Spark SQL. You can also use the third-party tool JDBCServer provided by Spark to connect to SparkSQL. + +Topology of CarbonData +---------------------- + +CarbonData runs as a data source inside Spark. Therefore, CarbonData does not start any additional processes on nodes in clusters. CarbonData engine runs inside the Spark executor. + + +.. figure:: /_static/images/en-us_image_0000001348770313.png + :alt: **Figure 2** Topology of CarbonData + + **Figure 2** Topology of CarbonData + +Data stored in CarbonData Table is divided into several CarbonData data files. Each time when data is queried, CarbonData Engine reads and filters data sets. CarbonData Engine runs as a part of the Spark Executor process and is responsible for handling a subset of data file blocks. + +Table data is stored in HDFS. Nodes in the same Spark cluster can be used as HDFS data nodes. + +CarbonData Features +------------------- + +- SQL: CarbonData is compatible with Spark SQL and supports SQL query operations performed on Spark SQL. +- Simple Table dataset definition: CarbonData allows you to define and create datasets by using user-friendly Data Definition Language (DDL) statements. CarbonData DDL is flexible and easy to use, and can define complex tables. +- Easy data management: CarbonData provides various data management functions for data loading and maintenance. CarbonData supports bulk loading of historical data and incremental loading of new data. Loaded data can be deleted based on load time and a specific loading operation can be undone. +- CarbonData file format is a columnar store in HDFS. This format has many new column-based file storage features, such as table splitting and data compression. CarbonData has the following characteristics: + + - Stores data along with index: Significantly accelerates query performance and reduces the I/O scans and CPU resources, when there are filters in the query. CarbonData index consists of multiple levels of indices. A processing framework can leverage this index to reduce the task that needs to be schedules and processed, and it can also perform skip scan in more finer grain unit (called blocklet) in task side scanning instead of scanning the whole file. + - Operable encoded data: Through supporting efficient compression, CarbonData can query on compressed/encoded data. The data can be converted just before returning the results to the users, which is called late materialized. + - Support for various use cases with one single data format: like interactive OLAP-style query, sequential access (big scan), and random access (narrow scan). + +Key Technologies and Advantages of CarbonData +--------------------------------------------- + +- Quick query response: CarbonData features high-performance query. The query speed of CarbonData is 10 times of that of Spark SQL. It uses dedicated data formats and applies multiple index technologies and multiple push-down optimizations, providing quick response to TB-level data queries. +- Efficient data compression: CarbonData compresses data by combining the lightweight and heavyweight compression algorithms. This significantly saves 60% to 80% data storage space and the hardware storage cost. diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/overview/index.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/overview/index.rst new file mode 100644 index 0000000..776c496 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/overview/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_1401.html + +.. _mrs_01_1401: + +Overview +======== + +This section is for MRS 3.x or later. For MRS 3.x or earlier, see :ref:`Using CarbonData (for Versions Earlier Than MRS 3.x) `. + +- :ref:`CarbonData Overview ` +- :ref:`Main Specifications of CarbonData ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + carbondata_overview + main_specifications_of_carbondata diff --git a/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/overview/main_specifications_of_carbondata.rst b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/overview/main_specifications_of_carbondata.rst new file mode 100644 index 0000000..53e0663 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_mrs_3.x_or_later/overview/main_specifications_of_carbondata.rst @@ -0,0 +1,72 @@ +:original_name: mrs_01_1403.html + +.. _mrs_01_1403: + +Main Specifications of CarbonData +================================= + + +Main Specifications of CarbonData +--------------------------------- + +.. table:: **Table 1** Main Specifications of CarbonData + + +------------------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+ + | Entity | Tested Value | Test Environment | + +====================================+========================================================================+=====================================================================================================+ + | Number of tables | 10000 | 3 nodes. 4 vCPUs and 20 GB memory for each executor. Driver memory: 5 GB, 3 executors. | + | | | | + | | | Total columns: 107 | + | | | | + | | | String: 75 | + | | | | + | | | Int: 13 | + | | | | + | | | BigInt: 7 | + | | | | + | | | Timestamp: 6 | + | | | | + | | | Double: 6 | + +------------------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+ + | Number of table columns | 2000 | 3 nodes. 4 vCPUs and 20 GB memory for each executor. Driver memory: 5 GB, 3 executors. | + +------------------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+ + | Maximum size of a raw CSV file | 200 GB | 17 cluster nodes. 150 GB memory and 25 vCPUs for each executor. Driver memory: 10 GB, 17 executors. | + +------------------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+ + | Number of CSV files in each folder | 100 folders. Each folder has 10 files. The size of each file is 50 MB. | 3 nodes. 4 vCPUs and 20 GB memory for each executor. Driver memory: 5 GB, 3 executors. | + +------------------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+ + | Number of load folders | 10000 | 3 nodes. 4 vCPUs and 20 GB memory for each executor. Driver memory: 5 GB, 3 executors. | + +------------------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+ + +The memory required for data loading depends on the following factors: + +- Number of columns +- Column values +- Concurrency (configured using **carbon.number.of.cores.while.loading**) +- Sort size in memory (configured using **carbon.sort.size**) +- Intermediate cache (configured using **carbon.graph.rowset.size**) + +Data loading of an 8 GB CSV file that contains 10 million records and 300 columns with each row size being about 0.8 KB requires about 10 GB executor memory. That is, set **carbon.sort.size** to **100000** and retain the default values for other parameters. + +Table Specifications +-------------------- + +.. table:: **Table 2** Table specifications + + +-----------------------------------------------------------------------------------------------------------+--------------+ + | Entity | Tested Value | + +===========================================================================================================+==============+ + | Number of secondary index tables | 10 | + +-----------------------------------------------------------------------------------------------------------+--------------+ + | Number of composite columns in a secondary index table | 5 | + +-----------------------------------------------------------------------------------------------------------+--------------+ + | Length of column name in a secondary index table (unit: character) | 120 | + +-----------------------------------------------------------------------------------------------------------+--------------+ + | Length of a secondary index table name (unit: character) | 120 | + +-----------------------------------------------------------------------------------------------------------+--------------+ + | Cumulative length of all secondary index table names + column names in an index table\* (unit: character) | 3800*\* | + +-----------------------------------------------------------------------------------------------------------+--------------+ + +.. note:: + + - \* Characters of column names in an index table refer to the upper limit allowed by Hive or the upper limit of available resources. + - \*\* Secondary index tables are registered using Hive and stored in HiveSERDEPROPERTIES in JSON format. The value of **SERDEPROPERTIES** supported by Hive can contain a maximum of 4,000 characters and cannot be changed. diff --git a/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/about_carbondata_table.rst b/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/about_carbondata_table.rst new file mode 100644 index 0000000..49d7f64 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/about_carbondata_table.rst @@ -0,0 +1,57 @@ +:original_name: mrs_01_0387.html + +.. _mrs_01_0387: + +About CarbonData Table +====================== + +Description +----------- + +CarbonData tables are similar to tables in the relational database management system (RDBMS). RDBMS tables consist of rows and columns to store data. CarbonData tables have fixed columns and also store structured data. In CarbonData, data is saved in entity files. + +Supported Data Types +-------------------- + +CarbonData tables support the following data types: + +- Int +- String +- BigInt +- Decimal +- Double +- TimeStamp + +:ref:`Table 1 ` describes the details about each data type. + +.. _mrs_01_0387__te1b84a2aca034e4b9e5745ab6b7bb9fd: + +.. table:: **Table 1** CarbonData data types + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Type | Description | + +===================================+======================================================================================================================================================================+ + | Int | 4-byte signed integer ranging from -2,147,483,648 to 2,147,483,647 | + | | | + | | .. note:: | + | | | + | | If a non-dictionary column is of the **int** data type, it is internally stored as the **BigInt** type. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | String | The maximum character string length is 100000. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | BigInt | Data is saved using the 64-bit technology. The value ranges from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Decimal | The default value is (10,0) and maximum value is (38,38). | + | | | + | | .. note:: | + | | | + | | When query with filters, append **BD** to the number to achieve accurate results. For example, **select \* from carbon_table where num = 1234567890123456.22BD**. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Double | Data is saved using the 64-bit technology. The value ranges from 4.9E-324 to 1.7976931348623157E308. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | TimeStamp | **yyyy-MM-dd HH:mm:ss** format is used by default. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. note:: + + Measurement of all Integer data is processed and displayed using the **BigInt** data type. diff --git a/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/creating_a_carbondata_table.rst b/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/creating_a_carbondata_table.rst new file mode 100644 index 0000000..0252a05 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/creating_a_carbondata_table.rst @@ -0,0 +1,83 @@ +:original_name: mrs_01_0388.html + +.. _mrs_01_0388: + +Creating a CarbonData Table +=========================== + +Scenario +-------- + +A CarbonData table must be created to load and query data. + +Creating a Table with Self-Defined Columns +------------------------------------------ + +Users can create a table by specifying its columns and data types. For analysis clusters with Kerberos authentication enabled, if a user wants to create a CarbonData table in a database other than the **default** database, the **Create** permission of the database must be added to the role to which the user is bound in Hive role management. + +Sample command: + +**CREATE TABLE** **IF NOT EXISTS productdb.productSalesTable (** + +**productNumber Int,** + +**productName String,** + +**storeCity String,** + +**storeProvince String,** + +**revenue Int)** + +**STORED BY** *'*\ **org.apache.carbondata.format'** + +**TBLPROPERTIES (** + +**'table_blocksize'='128',** + +**'DICTIONARY_EXCLUDE'='productName',** + +**'DICTIONARY_INCLUDE'='productNumber');** + +The following table describes parameters of preceding commands. + +.. table:: **Table 1** Parameter description + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+============================================================================================================================================================================================================================================================================================================================================================================+ + | productSalesTable | Table name. The table is used to load data for analysis. | + | | | + | | The table name consists of letters, digits, and underscores (_). | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | productdb | Database name. The database maintains logical connections with tables stored in it to identify and manage the tables. | + | | | + | | The database name consists of letters, digits, and underscores (_). | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | productNumber | Columns in the table. The columns are service entities for data analysis. | + | | | + | productName | The column name (field name) consists of letters, digits, and underscores (_). | + | | | + | storeCity | .. note:: | + | | | + | storeProvince | In CarbonData, you cannot configure a column's NOT NULL or default value, or the primary key of the table. | + | | | + | revenue | | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | table_blocksize | Block size of data files used by the CarbonData table. The value ranges from 1 MB to 2048 MB. The default is 1024 MB. | + | | | + | | - If the value of **table_blocksize** is too small, a large number of small files will be generated when data is loaded. This may affect the performance in using HDFS. | + | | - If the value of **table_blocksize** is too large, a large volume of data must be read from a block and the read concurrency is low when data is queried. As a result, the query performance deteriorates. | + | | | + | | You are advised to set the block size based on the data volume. For example, set the block size to 256 MB for GB-level data, 512 MB for TB-level data, and 1024 MB for PB-level data. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | DICTIONARY_EXCLUDE | Specifies the columns that do not generate dictionaries. This function is optional and applicable to columns of high complexity. By default, the system generates dictionaries for columns of the String type. However, as the number of values in the dictionaries increases, conversion operations by the dictionaries increase and the system performance deteriorates. | + | | | + | | Generally, if a column has over 50,000 unique data records, it is considered as a highly complex column and dictionary generation must be disabled. | + | | | + | | .. note:: | + | | | + | | Non-dictionary columns support only the String and Timestamp data types. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | DICTIONARY_INCLUDE | Specifies the columns that generate dictionaries. This function is optional and applicable to columns of low complexity. It improves the performance of queries with the **groupby** condition. Generally, the complexity of a dictionary column cannot exceed 50,000. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/deleting_a_carbondata_table.rst b/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/deleting_a_carbondata_table.rst new file mode 100644 index 0000000..bb4a063 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/deleting_a_carbondata_table.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_0389.html + +.. _mrs_01_0389: + +Deleting a CarbonData Table +=========================== + +Scenario +-------- + +Unused CarbonData tables can be deleted. After a CarbonData table is deleted, its metadata and loaded data are deleted together. + +Procedure +--------- + +#. Run the following command to delete a CarbonData table: + + **DROP TABLE [IF EXISTS] [db_name.]table_name;** + + **db_name** is optional. If **db_name** is not specified, the table named **table_name** in the current database is deleted. + + For example, run the following command to delete the **productSalesTable** table in the **productdb** database: + + **DROP TABLE productdb.productSalesTable;** + +#. Run the following command to confirm that the table is deleted: + + **SHOW TABLES;** diff --git a/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/index.rst b/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/index.rst new file mode 100644 index 0000000..7125075 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_0385.html + +.. _mrs_01_0385: + +Using CarbonData (for Versions Earlier Than MRS 3.x) +==================================================== + +- :ref:`Using CarbonData from Scratch ` +- :ref:`About CarbonData Table ` +- :ref:`Creating a CarbonData Table ` +- :ref:`Deleting a CarbonData Table ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + using_carbondata_from_scratch + about_carbondata_table + creating_a_carbondata_table + deleting_a_carbondata_table diff --git a/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/using_carbondata_from_scratch.rst b/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/using_carbondata_from_scratch.rst new file mode 100644 index 0000000..804be62 --- /dev/null +++ b/doc/component-operation-guide/source/using_carbondata_for_versions_earlier_than_mrs_3.x/using_carbondata_from_scratch.rst @@ -0,0 +1,134 @@ +:original_name: mrs_01_0386.html + +.. _mrs_01_0386: + +Using CarbonData from Scratch +============================= + +This section is for MRS 3.x or earlier. For MRS 3.x or later, see :ref:`Using CarbonData (for MRS 3.x or Later) `. + +This section describes the procedure of using Spark CarbonData. All tasks are based on the Spark-beeline environment. The tasks include: + +#. Connecting to Spark + + Before performing any operation on CarbonData, users must connect CarbonData to Spark. + +#. Creating a CarbonData table + + After connecting to Spark, users must create a CarbonData table to load and query data. + +#. Loading data to the CarbonData table + + Users load data from CSV files in HDFS to the CarbonData table. + +#. Querying data from the CarbonData table + + After data is loaded to the CarbonData table, users can run query commands such as **groupby** and **where**. + +Prerequisites +------------- + +A client has been installed. For details, see :ref:`Using an MRS Client `. + +Procedure +--------- + +#. Connect CarbonData to Spark. + + a. Prepare a client based on service requirements and use user **root** to log in to the node where the client is installed. + + For example, if you have updated the client on the Master2 node, log in to the Master2 node to use the client. For details, see :ref:`Using an MRS Client `. + + b. Run the following commands to switch the user and configure environment variables: + + **sudo su - omm** + + **source /opt/client/bigdata_env** + + c. For clusters with Kerberos authentication enabled, run the following command to authenticate the user. For clusters with Kerberos authentication disabled, skip this step. + + **kinit** **Spark username** + + .. note:: + + The user needs to be added to user groups **hadoop** (primary group) and **hive**. + + d. Run the following command to connect to the Spark environment. + + **spark-beeline** + +#. Create a CarbonData table. + + Run the following command to create a CarbonData table, which is used to load and query data. + + **CREATE TABLE x1 (imei string, deviceInformationId int, mac string, productdate timestamp, updatetime timestamp, gamePointId double, contractNumber double)** + + **STORED BY 'org.apache.carbondata.format'** + + **TBLPROPERTIES ('DICTIONARY_EXCLUDE'='mac','DICTIONARY_INCLUDE'='deviceInformationId');** + + The command output is as follows: + + .. code-block:: + + +---------+--+ + | result | + +---------+--+ + +---------+--+ + No rows selected (1.551 seconds) + +#. Load data from CSV files to the CarbonData table. + + Run the command to load data from CSV files based on the required parameters. Only CSV files are supported. The CSV column name and sequence configured in the **LOAD** command must be consistent with those in the CarbonData table. The data formats and number of data columns in the CSV files must also be the same as those in the CarbonData table. + + The CSV files must be stored on HDFS. You can upload the files to OBS and import them from OBS to HDFS on the **Files** page of the MRS console. + + If Kerberos authentication is enabled, prepare the CSV files in the work environment and import them to HDFS using open-source HDFS commands. In addition, assign the Spark user with the read and execute permissions of the files on HDFS by referring to :ref:`5 `. + + For example, the **data.csv** file is saved in the **tmp** directory of HDFS with the following contents: + + .. code-block:: + + x123,111,dd,2017-04-20 08:51:27,2017-04-20 07:56:51,2222,33333 + + The command for loading data from that file is as follows: + + **LOAD DATA inpath 'hdfs://hacluster/tmp/data.csv' into table x1 options('DELIMITER'=',','QUOTECHAR'='"','FILEHEADER'='imei, deviceinformationid,mac,productdate,updatetime,gamepointid,contractnumber');** + + The command output is as follows: + + .. code-block:: + + +---------+--+ + | Result | + +---------+--+ + +---------+--+ + No rows selected (3.039 seconds) + +#. Query data from the CarbonData. + + - **Obtaining the number of records** + + Run the following command to obtain the number of records in the CarbonData table: + + **select count(*) from x1;** + + - **Querying with the groupby condition** + + Run the following command to obtain the **deviceinformationid** records without repetition in the CarbonData table: + + **select deviceinformationid,count (distinct deviceinformationid) from x1 group by deviceinformationid;** + + - **Querying with the where condition** + + Run the following command to obtain specific **deviceinformationid** records: + + **select \* from x1 where deviceinformationid='111';** + + .. note:: + + If the query result has non-English characters, the columns in the query result may not be aligned. This is because characters of different languages occupy different widths. + +#. Run the following command to exit the Spark environment. + + **!quit** diff --git a/doc/component-operation-guide/source/using_clickhouse/backing_up_and_restoring_clickhouse_data_using_a_data_file.rst b/doc/component-operation-guide/source/using_clickhouse/backing_up_and_restoring_clickhouse_data_using_a_data_file.rst new file mode 100644 index 0000000..872947c --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/backing_up_and_restoring_clickhouse_data_using_a_data_file.rst @@ -0,0 +1,106 @@ +:original_name: mrs_01_24292.html + +.. _mrs_01_24292: + +Backing Up and Restoring ClickHouse Data Using a Data File +========================================================== + +Scenario +-------- + +This section describes how to back up data by exporting ClickHouse data to a CSV file and restore data using the CSV file. + +Prerequisites +------------- + +- You have installed the ClickHouse client. +- You have created a user with related permissions on ClickHouse tables on Manager. +- You have prepared a server for backup. + +Backing Up Data +--------------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. The current user must have the permission to create ClickHouse tables. If Kerberos authentication is disabled for the current cluster, skip this step. + + a. Run the following command if it is an MRS 3.1.0 cluster: + + **export CLICKHOUSE_SECURITY_ENABLED=true** + + b. **kinit** *Component service user* + + Example: **kinit clickhouseuser** + +#. Run the ClickHouse client command to export the ClickHouse table data to be backed up to a specified directory. + + **clickhouse client --host** *Host name/Instance IP address* **--secure --port 9440 --query=**"*Table query statement*" > *Path of the exported CSV file* + + The following shows an example of backing up data in the **test** table to the **default_test.csv** file on the ClickHouse instance **10.244.225.167**. + + **clickhouse client --host 10.244.225.167 --secure --port 9440 --query="select \* from default.test FORMAT CSV" > /opt/clickhouse/default_test.csv** + +#. Upload the exported CSV file to the backup server. + +Restoring Data +-------------- + +#. Upload the backup data file on the backup server to the directory where the ClickHouse client is located. + + For example, upload the **default_test.csv** backup file to the **/opt/clickhouse** directory. + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. The current user must have the permission to create ClickHouse tables. If Kerberos authentication is disabled for the current cluster, skip this step. + + a. Run the following command if it is an MRS 3.1.0 cluster: + + **export CLICKHOUSE_SECURITY_ENABLED=true** + + b. **kinit** *Component service user* + + Example: **kinit clickhouseuser** + +#. Run the ClickHouse client command to log in to the ClickHouse cluster. + + **clickhouse client --host** *Host name/Instance IP address* **--secure** **--port 9440** + +#. .. _mrs_01_24292__li14929104515101: + + Create a table with the format corresponding to the CSV file. + + **CREATE TABLE [IF NOT EXISTS]** *[database_name.]table_name* **[ON CLUSTER** *Cluster name*\ **]** + + **(** + + *name1* **[**\ *type1*\ **] [DEFAULT\|**\ *materialized*\ **\|ALIAS** *expr1*\ **],** + + *name2* **[**\ *type2*\ **] [DEFAULT\|**\ *materialized*\ **\|ALIAS** *expr2*\ **],** + + **...** + + **)** *ENGINE* = *engine* + +#. Import the content in the backup file to the table created in :ref:`7 ` to restore data. + + **clickhouse client --host** *Host name/Instance IP address* **--secure --port 9440 --query=**"**insert into** *Table name* **FORMAT CSV**" **<** *CSV file path* + + The following shows an example of restoring data from the **default_test.csv** backup file to the **test_cpy** table on the ClickHouse instance **10.244.225.167**. + + **clickhouse client --host** **10.244.225.167** **--secure --port 9440 --query="insert into default.test_cpy FORMAT CSV" < /opt/clickhouse/default_test.csv** diff --git a/doc/component-operation-guide/source/using_clickhouse/clickhouse_log_overview.rst b/doc/component-operation-guide/source/using_clickhouse/clickhouse_log_overview.rst new file mode 100644 index 0000000..2807b3e --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/clickhouse_log_overview.rst @@ -0,0 +1,90 @@ +:original_name: mrs_01_2399.html + +.. _mrs_01_2399: + +ClickHouse Log Overview +======================= + +Log Description +--------------- + +**Log path**: The default storage path of ClickHouse log files is as follows: **${BIGDATA_LOG_HOME}/clickhouse** + +**Log archive rule**: The automatic ClickHouse log compression function is enabled. By default, when the size of logs exceeds 100 MB, logs are automatically compressed into a log file named in the following format: **\ **.**\ *[ID]*\ **.gz**. A maximum of 10 latest compressed files are reserved by default. The number of compressed files can be configured on Manager. + +.. table:: **Table 1** ClickHouse log list + + +---------------------+---------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | Log Type | Log File Name | Description | + +=====================+===========================================================================================================================+======================================================================================================================================+ + | Run logs | /var/log/Bigdata/clickhouse/clickhouseServer/clickhouse-server.err.log | Path of ClickHouseServer error log files. | + +---------------------+---------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | | /var/log/Bigdata/clickhouse/clickhouseServer/checkService.log | Path of key ClickHouseServer run log files. | + +---------------------+---------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | | /var/log/Bigdata/clickhouse/clickhouseServer/clickhouse-server.log | | + +---------------------+---------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | | /var/log/Bigdata/clickhouse/balance/start.log | Path of ClickHouseBalancer startup log files. | + +---------------------+---------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | | /var/log/Bigdata/clickhouse/balance/error.log | Path of ClickHouseBalancer error log files. | + +---------------------+---------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | | /var/log/Bigdata/clickhouse/balance/access_http.log | Path of ClickHouseBalancer run log files. | + +---------------------+---------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | Data migration logs | /var/log/Bigdata/clickhouse/migration/*Data migration task name*/clickhouse-copier_{timestamp}_{processId}/copier.log | Run logs generated when you use the migration tool by referring to :ref:`Using the ClickHouse Data Migration Tool `. | + +---------------------+---------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | | /var/log/Bigdata/clickhouse/migration/*Data migration task name*/clickhouse-copier_{timestamp}_{processId}/copier.err.log | Error logs generated when you use the migration tool by referring to :ref:`Using the ClickHouse Data Migration Tool `. | + +---------------------+---------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + +Log Level +--------- + +:ref:`Table 2 ` describes the log levels supported by ClickHouse. + +Levels of run logs are error, warning, trace, information, and debug from the highest to the lowest priority. Run logs of equal or higher levels are recorded. The higher the specified log level, the fewer the logs recorded. + +.. _mrs_01_2399__tc09b739e3eb34797a6da936a37654e97: + +.. table:: **Table 2** Log levels + + +----------+-------------+------------------------------------------------------------------------------------------+ + | Log Type | Level | Description | + +==========+=============+==========================================================================================+ + | Run log | error | Logs of this level record error information about system running. | + +----------+-------------+------------------------------------------------------------------------------------------+ + | | warning | Logs of this level record exception information about the current event processing. | + +----------+-------------+------------------------------------------------------------------------------------------+ + | | trace | Logs of this level record trace information about the current event processing. | + +----------+-------------+------------------------------------------------------------------------------------------+ + | | information | Logs of this level record normal running status information about the system and events. | + +----------+-------------+------------------------------------------------------------------------------------------+ + | | debug | Logs of this level record system running and debugging information. | + +----------+-------------+------------------------------------------------------------------------------------------+ + +To modify log levels, perform the following operations: + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > **Services** > **ClickHouse** > **Configurations**. +#. Select **All Configurations**. +#. On the menu bar on the left, select the log menu of the target role. +#. Select a desired log level. +#. Click **Save**. Then, click **OK**. + +.. note:: + + The configurations take effect immediately without the need to restart the service. + +Log Format +---------- + +The following table lists the ClickHouse log format: + +.. table:: **Table 3** Log formats + + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Log Type | Format | Example | + +=======================+========================================================================================================================================================+==========================================================================================================================================================================================================================================+ + | Run log | <*yyyy-MM-dd HH:mm:ss,SSS*>|<*Log level*>|<*Name of the thread that generates the log*>|<*Message in the log*>|<*Location where the log event occurs*> | 2021.02.23 15:26:30.691301 [ 6085 ] {} DynamicQueryHandler: Code: 516, e.displayText() = DB::Exception: default: Authentication failed: password is incorrect or there is no user with such name, Stack trace (when copying this | + | | | | + | | | message, always include the lines below): | + | | | | + | | | 0. Poco::Exception::Exception(std::__1::basic_string, std::__1::allocator > const&, int) @ 0x1250e59c | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_clickhouse/clickhouse_table_engine_overview.rst b/doc/component-operation-guide/source/using_clickhouse/clickhouse_table_engine_overview.rst new file mode 100644 index 0000000..90f2705 --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/clickhouse_table_engine_overview.rst @@ -0,0 +1,389 @@ +:original_name: mrs_01_24105.html + +.. _mrs_01_24105: + +ClickHouse Table Engine Overview +================================ + +Background +---------- + +Table engines play a key role in ClickHouse to determine: + +- Where to write and read data +- Supported query modes +- Whether concurrent data access is supported +- Whether indexes can be used +- Whether multi-thread requests can be executed +- Parameters used for data replication + +This section describes MergeTree and Distributed engines, which are the most important and frequently used ClickHouse table engines. + +MergeTree Family +---------------- + +Engines of the MergeTree family are the most universal and functional table engines for high-load tasks. They have the following key features: + +- Data is stored by partition and block based on partitioning keys. +- Data index is sorted based on primary keys and the **ORDER BY** sorting keys. +- Data replication is supported by table engines prefixed with Replicated. +- Data sampling is supported. + +When data is written, a table with this type of engine divides data into different folders based on the partitioning key. Each column of data in the folder is an independent file. A file that records serialized index sorting is created. This structure reduces the volume of data to be retrieved during data reading, greatly improving query efficiency. + +- MergeTree + + **Syntax for creating a table**: + + .. code-block:: + + CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] + ( + name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1], + name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2], + ... + INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1, + INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2 + ) ENGINE = MergeTree() + ORDER BY expr + [PARTITION BY expr] + [PRIMARY KEY expr] + [SAMPLE BY expr] + [TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...] + [SETTINGS name=value, ...] + + **Example**: + + .. code-block:: + + CREATE TABLE default.test ( + name1 DateTime, + name2 String, + name3 String, + name4 String, + name5 Date, + ... + ) ENGINE = MergeTree() + PARTITION BY toYYYYMM(name5) + ORDER BY (name1, name2) + SETTINGS index_granularity = 8192 + + Parameters in the example are described as follows: + + - **ENGINE = MergeTree()**: specifies the MergeTree engine. + - **PARTITION BY** **toYYYYMM(name4)**: specifies the partition. The sample data is partitioned by month, and a folder is created for each month. + - **ORDER BY**: specifies the sorting field. A multi-field index can be sorted. If the first field is the same, the second field is used for sorting, and so on. + - **index_granularity = 8192**: specifies the index granularity. One index value is recorded for every 8,192 data records. + + If the data to be queried exists in a partition or sorting field, the data query time can be greatly reduced. + +- ReplacingMergeTree + + Different from MergeTree, ReplacingMergeTree deletes duplicate entries with the same sorting key. ReplacingMergeTree is suitable for clearing duplicate data to save space, but it does not guarantee the absence of duplicate data. Generally, it is not recommended. + + **Syntax for creating a table**: + + .. code-block:: + + CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] + ( + name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1], + name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2], + ... + ) ENGINE = ReplacingMergeTree([ver]) + [PARTITION BY expr] + [ORDER BY expr] + [SAMPLE BY expr] + [SETTINGS name=value, ...] + +- SummingMergeTree + + When merging data parts in SummingMergeTree tables, ClickHouse merges all rows with the same primary key into one row that contains summed values for the columns with the numeric data type. If the primary key is composed in a way that a single key value corresponds to large number of rows, storage volume can be significantly reduced and the data query speed can be accelerated. + + **Syntax for creating a table**: + + .. code-block:: + + CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] + ( + name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1], + name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2], + ... + ) ENGINE = SummingMergeTree([columns]) + [PARTITION BY expr] + [ORDER BY expr] + [SAMPLE BY expr] + [SETTINGS name=value, ...] + + **Example**: + + Create a SummingMergeTree table named **testTable**. + + .. code-block:: + + CREATE TABLE testTable + ( + id UInt32, + value UInt32 + ) + ENGINE = SummingMergeTree() + ORDER BY id + + Insert data into the table. + + .. code-block:: + + INSERT INTO testTable Values(5,9),(5,3),(4,6),(1,2),(2,5),(1,4),(3,8); + INSERT INTO testTable Values(88,5),(5,5),(3,7),(3,5),(1,6),(2,6),(4,7),(4,6),(43,5),(5,9),(3,6); + + Query all data in unmerged parts. + + .. code-block:: + + SELECT * FROM testTable + ┌─id─┬─value─┐ + │ 1 │ 6 │ + │ 2 │ 5 │ + │ 3 │ 8 │ + │ 4 │ 6 │ + │ 5 │ 12 │ + └───┴──── ┘ + ┌─id─┬─value─┐ + │ 1 │ 6 │ + │ 2 │ 6 │ + │ 3 │ 18 │ + │ 4 │ 13 │ + │ 5 │ 14 │ + │ 43 │ 5 │ + │ 88 │ 5 │ + └───┴──── ┘ + + If ClickHouse has not summed up all rows and you need to aggregate data by ID, use the **sum** function and **GROUP BY** statement. + + .. code-block:: + + SELECT id, sum(value) FROM testTable GROUP BY id + ┌─id─┬─sum(value)─┐ + │ 4 │ 19 │ + │ 3 │ 26 │ + │ 88 │ 5 │ + │ 2 │ 11 │ + │ 5 │ 26 │ + │ 1 │ 12 │ + │ 43 │ 5 │ + └───┴───────┘ + + Merge rows manually. + + .. code-block:: + + OPTIMIZE TABLE testTable + + Query data in the **testTable** table again. + + .. code-block:: + + SELECT * FROM testTable + ┌─id─┬─value─┐ + │ 1 │ 12 │ + │ 2 │ 11 │ + │ 3 │ 26 │ + │ 4 │ 19 │ + │ 5 │ 26 │ + │ 43 │ 5 │ + │ 88 │ 5 │ + └───┴──── ┘ + + SummingMergeTree uses the **ORDER BY** sorting keys as the condition keys to aggregate data. That is, if sorting keys are the same, data records are merged into one and the specified merged fields are aggregated. + + Data is pre-aggregated only when merging is executed in the background, and the merging execution time cannot be predicted. Therefore, it is possible that some data has been pre-aggregated and some data has not been aggregated. Therefore, the **GROUP BY** statement must be used during aggregation. + +- AggregatingMergeTree + + AggregatingMergeTree is a pre-aggregation engine used to improve aggregation performance. When merging partitions, the AggregatingMergeTree engine aggregates data based on predefined conditions, calculates data based on predefined aggregate functions, and saves the data in binary format to tables. + + **Syntax for creating a table**: + + .. code-block:: + + CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] + ( + name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1], + name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2], + ... + ) ENGINE = AggregatingMergeTree() + [PARTITION BY expr] + [ORDER BY expr] + [SAMPLE BY expr] + [TTL expr] + [SETTINGS name=value, ...] + + **Example**: + + You do not need to set the AggregatingMergeTree parameter separately. When partitions are merged, data in each partition is aggregated based on the **ORDER BY** sorting key. You can set the aggregate functions to be used and column fields to be calculated by defining the AggregateFunction type, as shown in the following example: + + .. code-block:: + + create table test_table ( + name1 String, + name2 String, + name3 AggregateFunction(uniq,String), + name4 AggregateFunction(sum,Int), + name5 DateTime + ) ENGINE = AggregatingMergeTree() + PARTITION BY toYYYYMM(name5) + ORDER BY (name1,name2) + PRIMARY KEY name1; + + When data of the AggregateFunction type is written or queried, the **\*state** and **\*merge** functions need to be called. The asterisk (``*``) indicates the aggregate functions used for defining the field type. For example, the **uniq** and **sum** functions are specified for the **name3** and **name4** fields defined in the **test_table**, respectively. Therefore, you need to call the **uniqState** and **sumState** functions and run the **INSERT** and **SELECT** statements when writing data into the table. + + .. code-block:: + + insert into test_table select '8','test1',uniqState('name1'),sumState(toInt32(100)),'2021-04-30 17:18:00'; + insert into test_table select '8','test1',uniqState('name1'),sumState(toInt32(200)),'2021-04-30 17:18:00'; + + When querying data, you need to call the corresponding functions **uniqMerge** and **sumMerge**. + + .. code-block:: + + select name1,name2,uniqMerge(name3),sumMerge(name4) from test_table group by name1,name2; + ┌─name1─┬─name2─┬─uniqMerge(name3)─┬─sumMerge(name4)─┐ + │ 8 │ test1 │ 1 │ 300 │ + └──── ┴──── ┴──────────┴───────── ┘ + + AggregatingMergeTree is more commonly used with materialized views, which are query views of other data tables at the upper layer. + +- CollapsingMergeTree + + CollapsingMergeTree defines a **Sign** field to record status of data rows. If **Sign** is **1**, the data in this row is valid. If **Sign** is **-1**, the data in this row needs to be deleted. + + **Syntax for creating a table**: + + .. code-block:: + + CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] + ( + name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1], + name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2], + ... + ) ENGINE = CollapsingMergeTree(sign) + [PARTITION BY expr] + [ORDER BY expr] + [SAMPLE BY expr] + [SETTINGS name=value, ...] + +- VersionedCollapsingMergeTree + + The VersionedCollapsingMergeTree engine adds **Version** to the table creation statement to record the mapping between a **state** row and a **cancel** row in case that rows are out of order. The rows with the same primary key, same **Version**, and opposite **Sign** will be deleted during compaction. + + **Syntax for creating a table**: + + .. code-block:: + + CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] + ( + name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1], + name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2], + ... + ) ENGINE = VersionedCollapsingMergeTree(sign, version) + [PARTITION BY expr] + [ORDER BY expr] + [SAMPLE BY expr] + [SETTINGS name=value, ...] + +- GraphiteMergeTree + + The GraphiteMergeTree engine is used to store data in the time series database Graphite. + + **Syntax for creating a table**: + + .. code-block:: + + CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] + ( + Path String, + Time DateTime, + Value , + Version + ... + ) ENGINE = GraphiteMergeTree(config_section) + [PARTITION BY expr] + [ORDER BY expr] + [SAMPLE BY expr] + [SETTINGS name=value, ...] + +Replicated*MergeTree Engines +---------------------------- + +All engines of the MergeTree family in ClickHouse prefixed with Replicated become MergeTree engines that support replicas. + +|image1| + +Replicated series engines use ZooKeeper to synchronize data. When a replicated table is created, all replicas of the same shard are synchronized based on the information registered with ZooKeeper. + +**Template for creating a Replicated engine**: + +.. code-block:: + + ENGINE = Replicated*MergeTree('Storage path in ZooKeeper','Replica name', ...) + +Two parameters need to be specified for a Replicated engine: + +- *Storage path in ZooKeeper*: specifies the path for storing table data in ZooKeeper. The path format is **/clickhouse/tables/{shard}/Database name/Table name**. +- *Replica name*: Generally, **{replica}** is used. + +For details about the example, see :ref:`Creating a ClickHouse Table `. + +Distributed Engine +------------------ + +The Distributed engine does not store any data. It serves as a transparent proxy for data shards and can automatically transmit data to each node in the cluster. Distributed tables need to work with other local data tables. Distributed tables distribute received read and write tasks to each local table where data is stored. + + +.. figure:: /_static/images/en-us_image_0000001349289865.png + :alt: **Figure 1** Working principle of the Distributed engine + + **Figure 1** Working principle of the Distributed engine + +**Template for creating a Distributed engine**: + +.. code-block:: + + ENGINE = Distributed(cluster_name, database_name, table_name, [sharding_key]) + +Parameters of a distributed table are described as follows: + +- **cluster_name**: specifies the cluster name. When a distributed table is read or written, the cluster configuration information is used to search for the corresponding ClickHouse instance node. +- **database_name**: specifies the database name. +- **table_name**: specifies the name of a local table in the database. It is used to map a distributed table to a local table. +- **sharding_key** (optional): specifies the sharding key, based on which a distributed table distributes data to each local table. + +**Example**: + +.. code-block:: + + -- Create a ReplicatedMergeTree local table named test. + CREATE TABLE default.test ON CLUSTER default_cluster_1 + ( + `EventDate` DateTime, + `id` UInt64 + ) + ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/default/test', '{replica}') + PARTITION BY toYYYYMM(EventDate) + ORDER BY id + + -- Create a distributed table named test_all based on the local table test. + CREATE TABLE default.test_all ON CLUSTER default_cluster_1 + ( + `EventDate` DateTime, + `id` UInt64 + ) + ENGINE = Distributed(default_cluster_1, default, test, rand()) + +**Rules for creating a distributed table**: + +- When creating a distributed table, add **ON CLUSTER** *cluster_name* to the table creation statement so that the statement can be executed once on a ClickHouse instance and then distributed to all instances in the cluster for execution. +- Generally, a distributed table is named in the following format: *Local table name*\ \_all. It forms a one-to-many mapping with local tables. Then, multiple local tables can be operated using the distributed table proxy. +- Ensure that the structure of a distributed table is the same as that of local tables. If they are inconsistent, no error is reported during table creation, but an exception may be reported during data query or insertion. + +.. |image1| image:: /_static/images/en-us_image_0000001295770752.png diff --git a/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/alter_table_modifying_a_table_structure.rst b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/alter_table_modifying_a_table_structure.rst new file mode 100644 index 0000000..16d288c --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/alter_table_modifying_a_table_structure.rst @@ -0,0 +1,52 @@ +:original_name: mrs_01_24204.html + +.. _mrs_01_24204: + +ALTER TABLE: Modifying a Table Structure +======================================== + +This section describes the basic syntax and usage of the SQL statement for modifying a table structure in ClickHouse. + +Basic Syntax +------------ + +**ALTER TABLE** [*database_name*].\ *name* [**ON CLUSTER** *cluster*] **ADD**\ \|\ **DROP**\ \|\ **CLEAR**\ \|\ **COMMENT**\ \|\ **MODIFY** **COLUMN** ... + +.. note:: + + **ALTER** supports only ``*``\ MergeTree, Merge, and Distributed engine tables. + +Example +------- + +.. code-block:: + + -- Add the test01 column to the t1 table. + ALTER TABLE t1 ADD COLUMN test01 String DEFAULT 'defaultvalue'; + -- Query the modified table t1. + desc t1 + ┌─name────┬─type─┬─default_type─┬─default_expression ┬─comment─┬─codec_expression─┬─ttl_expression─┐ + │ id │ UInt8 │ │ │ │ │ │ + │ name │ String │ │ │ │ │ │ + │ address │ String │ │ │ │ │ │ + │ test01 │ String │ DEFAULT │ 'defaultvalue' │ │ │ │ + └───────┴────┴────────┴────────── ┴───── ┴──────────┴─────────┘ + -- Change the type of the name column in the t1 table to UInt8. + ALTER TABLE t1 MODIFY COLUMN name UInt8; + -- Query the modified table t1. + desc t1 + ┌─name────┬─type─┬─default_type─┬─default_expression ┬─comment─┬─codec_expression─┬─ttl_expression─┐ + │ id │ UInt8 │ │ │ │ │ │ + │ name │ UInt8 │ │ │ │ │ │ + │ address │ String │ │ │ │ │ │ + │ test01 │ String │ DEFAULT │ 'defaultvalue' │ │ │ │ + └───────┴────┴────────┴────────── ┴───── ┴──────────┴─────────┘ + -- Delete the test01 column from the t1 table. + ALTER TABLE t1 DROP COLUMN test01; + -- Query the modified table t1. + desc t1 + ┌─name────┬─type─┬─default_type─┬─default_expression ┬─comment─┬─codec_expression─┬─ttl_expression─┐ + │ id │ UInt8 │ │ │ │ │ │ + │ name │ UInt8 │ │ │ │ │ │ + │ address │ String │ │ │ │ │ │ + └───────┴────┴────────┴────────── ┴───── ┴──────────┴─────────┘ diff --git a/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/create_database_creating_a_database.rst b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/create_database_creating_a_database.rst new file mode 100644 index 0000000..d9ec83d --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/create_database_creating_a_database.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_24200.html + +.. _mrs_01_24200: + +CREATE DATABASE: Creating a Database +==================================== + +This section describes the basic syntax and usage of the SQL statement for creating a ClickHouse database. + +Basic Syntax +------------ + +**CREATE DATABASE [IF NOT EXISTS]** *Database_name* **[ON CLUSTER** *ClickHouse cluster name*\ **]** + +.. note:: + + The syntax **ON CLUSTER** *ClickHouse cluster name* enables the Data Definition Language (DDL) statement to be executed on all instances in the cluster at a time. You can run the following statement to obtain the cluster name from the **cluster** field: + + **select cluster,shard_num,replica_num,host_name from system.clusters;** + +Example +------- + +.. code-block:: + + -- Create a database named test. + CREATE DATABASE test ON CLUSTER default_cluster; + -- After the creation is successful, run the query command for verification. + show databases; + ┌─name───┐ + │ default │ + │ system │ + │ test │ + └──────┘ diff --git a/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/create_table_creating_a_table.rst b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/create_table_creating_a_table.rst new file mode 100644 index 0000000..60e2256 --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/create_table_creating_a_table.rst @@ -0,0 +1,60 @@ +:original_name: mrs_01_24201.html + +.. _mrs_01_24201: + +CREATE TABLE: Creating a Table +============================== + +This section describes the basic syntax and usage of the SQL statement for creating a ClickHouse table. + +Basic Syntax +------------ + +- Method 1: Creating a table named **table_name** in the specified **database_name** database. + + If the table creation statement does not contain **database_name**, the name of the database selected during client login is used by default. + + **CREATE TABLE [IF NOT EXISTS]** *[database_name.]table_name* **[ON CLUSTER** *ClickHouse cluster name*\ **]** + + **(** + + *name1* **[**\ *type1*\ **] [DEFAULT\|MATERIALIZED\|ALIAS** *expr1*\ **],** + + *name2* **[**\ *type2*\ **] [DEFAULT\|MATERIALIZED\|ALIAS** *expr2*\ **],** + + **...** + + **)** *ENGINE* = *engine\_name()* + + [**PARTITION BY** *expr_list*] + + [**ORDER BY** *expr_list*] + + .. caution:: + + You are advised to use **PARTITION BY** to create table partitions when creating a ClickHouse table. The ClickHouse data migration tool migrates data based on table partitions. If you do not use **PARTITION BY** to create table partitions during table creation, the table data cannot be migrated on the GUI in :ref:`Using the ClickHouse Data Migration Tool `. + +- Method 2: Creating a table with the same structure as **database_name2.table_name2** and specifying a different table engine for the table + + If no table engine is specified, the created table uses the same table engine as **database_name2.table_name2**. + + **CREATE TABLE [IF NOT EXISTS]** *[database_name.]table_name* **AS** [*database_name2*.]\ *table_name*\ 2 [ENGINE = *engine\_name*] + +- Method 3: Using the specified engine to create a table with the same structure as the result of the **SELECT** clause and filling it with the result of the **SELECT** clause + + **CREATE TABLE [IF NOT EXISTS]** *[database_name.]table_name* *ENGINE* = *engine\_name* **AS SELECT** ... + +Example +------- + +.. code-block:: + + -- Create a table named test in the default database and default_cluster cluster. + CREATE TABLE default.test ON CLUSTER default_cluster + ( + `EventDate` DateTime, + `id` UInt64 + ) + ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/default/test', '{replica}') + PARTITION BY toYYYYMM(EventDate) + ORDER BY id diff --git a/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/desc_querying_a_table_structure.rst b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/desc_querying_a_table_structure.rst new file mode 100644 index 0000000..be73779 --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/desc_querying_a_table_structure.rst @@ -0,0 +1,26 @@ +:original_name: mrs_01_24205.html + +.. _mrs_01_24205: + +DESC: Querying a Table Structure +================================ + +This section describes the basic syntax and usage of the SQL statement for querying a table structure in ClickHouse. + +Basic Syntax +------------ + +**DESC**\ \|\ **DESCRIBE** **TABLE** [*database_name*.]\ *table* [**INTO** OUTFILE filename] [FORMAT format] + +Example +------- + +.. code-block:: + + -- Query the t1 table structure. + desc t1; + ┌─name────┬─type─┬─default_type─┬─default_expression ┬─comment─┬─codec_expression─┬─ttl_expression─┐ + │ id │ UInt8 │ │ │ │ │ │ + │ name │ UInt8 │ │ │ │ │ │ + │ address │ String │ │ │ │ │ │ + └───────┴────┴────────┴────────── ┴───── ┴──────────┴─────────┘ diff --git a/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/drop_deleting_a_table.rst b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/drop_deleting_a_table.rst new file mode 100644 index 0000000..621f99c --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/drop_deleting_a_table.rst @@ -0,0 +1,21 @@ +:original_name: mrs_01_24208.html + +.. _mrs_01_24208: + +DROP: Deleting a Table +====================== + +This section describes the basic syntax and usage of the SQL statement for deleting a ClickHouse table. + +Basic Syntax +------------ + +**DROP** [**TEMPORARY**] **TABLE** [**IF EXISTS**] [*database_name*.]\ *name* [**ON CLUSTER** *cluster*] + +Example +------- + +.. code-block:: + + -- Delete the t1 table. + drop table t1; diff --git a/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/index.rst b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/index.rst new file mode 100644 index 0000000..96c0e58 --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/index.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_24199.html + +.. _mrs_01_24199: + +Common ClickHouse SQL Syntax +============================ + +- :ref:`CREATE DATABASE: Creating a Database ` +- :ref:`CREATE TABLE: Creating a Table ` +- :ref:`INSERT INTO: Inserting Data into a Table ` +- :ref:`SELECT: Querying Table Data ` +- :ref:`ALTER TABLE: Modifying a Table Structure ` +- :ref:`DESC: Querying a Table Structure ` +- :ref:`DROP: Deleting a Table ` +- :ref:`SHOW: Displaying Information About Databases and Tables ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + create_database_creating_a_database + create_table_creating_a_table + insert_into_inserting_data_into_a_table + select_querying_table_data + alter_table_modifying_a_table_structure + desc_querying_a_table_structure + drop_deleting_a_table + show_displaying_information_about_databases_and_tables diff --git a/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/insert_into_inserting_data_into_a_table.rst b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/insert_into_inserting_data_into_a_table.rst new file mode 100644 index 0000000..f478ae1 --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/insert_into_inserting_data_into_a_table.rst @@ -0,0 +1,33 @@ +:original_name: mrs_01_24202.html + +.. _mrs_01_24202: + +INSERT INTO: Inserting Data into a Table +======================================== + +This section describes the basic syntax and usage of the SQL statement for inserting data to a table in ClickHouse. + +Basic Syntax +------------ + +- Method 1: Inserting data in standard format + + **INSERT INTO** *[database_name.]table* [(*c1, c2, c3*)] **VALUES** (*v11, v12, v13*), (*v21, v22, v23*), ... + +- Method 2: Using the **SELECT** result to insert data + + **INSERT INTO** *[database_name.]table* [(c1, c2, c3)] **SELECT** ... + +Example +------- + +.. code-block:: + + -- Insert data into the test2 table. + insert into test2 (id, name) values (1, 'abc'), (2, 'bbbb'); + -- Query data in the test2 table. + select * from test2; + ┌─id─┬─name─┐ + │ 1 │ abc │ + │ 2 │ bbbb │ + └───┴────┘ diff --git a/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/select_querying_table_data.rst b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/select_querying_table_data.rst new file mode 100644 index 0000000..61874f3 --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/select_querying_table_data.rst @@ -0,0 +1,70 @@ +:original_name: mrs_01_24203.html + +.. _mrs_01_24203: + +SELECT: Querying Table Data +=========================== + +This section describes the basic syntax and usage of the SQL statement for querying table data in ClickHouse. + +Basic Syntax +------------ + +**SELECT** [**DISTINCT**] expr_list + +[**FROM** [*database_name*.]\ *table* \| (subquery) \| table_function] [**FINAL**] + +[SAMPLE sample_coeff] + +[ARRAY **JOIN** ...] + +[**GLOBAL**] [**ANY**\ \|\ **ALL**\ \|\ **ASOF**] [**INNER**\ \|\ **LEFT**\ \|\ **RIGHT**\ \|\ **FULL**\ \|\ **CROSS**] [**OUTER**\ \|SEMI|ANTI] **JOIN** (subquery)\|\ **table** (**ON** )|(**USING** ) + +[PREWHERE expr] + +[**WHERE** expr] + +[**GROUP BY** expr_list] [**WITH** TOTALS] + +[**HAVING** expr] + +[**ORDER BY** expr_list] [**WITH** FILL] [**FROM** expr] [**TO** expr] [STEP expr] + +[**LIMIT** [offset_value, ]n **BY** columns] + +[**LIMIT** [n, ]m] [**WITH** TIES] + +[**UNION ALL** ...] + +[**INTO** OUTFILE filename] + +[FORMAT format] + +Example +------- + +.. code-block:: + + -- View ClickHouse cluster information. + select * from system.clusters; + -- View the macros set for the current node. + select * from system.macros; + -- Check the database capacity. + select + sum(rows) as "Total number of rows", + formatReadableSize(sum(data_uncompressed_bytes)) as "Original size", + formatReadableSize(sum(data_compressed_bytes)) as "Compression size", + round(sum(data_compressed_bytes) / sum(data_uncompressed_bytes) * 100, + 0) "Compression rate" + from system.parts; + -- Query the capacity of the test table. Add or modify the where clause based on the site requirements. + select + sum(rows) as "Total number of rows", + formatReadableSize(sum(data_uncompressed_bytes)) as "Original size", + formatReadableSize(sum(data_compressed_bytes)) as "Compression size", + round(sum(data_compressed_bytes) / sum(data_uncompressed_bytes) * 100, + 0) "Compression rate" + from system.parts + where table in ('test') + and partition like '2020-11-%' + group by table; diff --git a/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/show_displaying_information_about_databases_and_tables.rst b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/show_displaying_information_about_databases_and_tables.rst new file mode 100644 index 0000000..33ac41e --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/common_clickhouse_sql_syntax/show_displaying_information_about_databases_and_tables.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_24207.html + +.. _mrs_01_24207: + +SHOW: Displaying Information About Databases and Tables +======================================================= + +This section describes the basic syntax and usage of the SQL statement for displaying information about databases and tables in ClickHouse. + +Basic Syntax +------------ + +**show databases** + +**show tables** + +Example +------- + +.. code-block:: + + -- Query database information. + show databases; + ┌─name────┐ + │ default │ + │ system │ + │ test │ + └───────┘ + -- Query table information. + show tables; + ┌─name──┐ + │ t1 │ + │ test │ + │ test2 │ + │ test5 │ + └─────┘ diff --git a/doc/component-operation-guide/source/using_clickhouse/creating_a_clickhouse_table.rst b/doc/component-operation-guide/source/using_clickhouse/creating_a_clickhouse_table.rst new file mode 100644 index 0000000..eef634d --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/creating_a_clickhouse_table.rst @@ -0,0 +1,322 @@ +:original_name: mrs_01_2398.html + +.. _mrs_01_2398: + +Creating a ClickHouse Table +=========================== + +ClickHouse implements the replicated table mechanism based on the ReplicatedMergeTree engine and ZooKeeper. When creating a table, you can specify an engine to determine whether the table is highly available. Shards and replicas of each table are independent of each other. + +ClickHouse also implements the distributed table mechanism based on the Distributed engine. Views are created on all shards (local tables) for distributed query, which is easy to use. ClickHouse has the concept of data sharding, which is one of the features of distributed storage. That is, parallel read and write are used to improve efficiency. + +The ClickHouse cluster table engine that uses Kunpeng as the CPU architecture does not support HDFS and Kafka. + +.. _mrs_01_2398__section1386435625: + +Viewing cluster and Other Environment Parameters of ClickHouse +-------------------------------------------------------------- + +#. Use the ClickHouse client to connect to the ClickHouse server by referring to :ref:`Using ClickHouse from Scratch `. + +#. .. _mrs_01_2398__li5153155032517: + + Query the cluster identifier and other information about the environment parameters. + + **select cluster,shard_num,replica_num,host_name from system.clusters;** + + .. code-block:: + + SELECT + cluster, + shard_num, + replica_num, + host_name + FROM system.clusters + + ┌─cluster───────────┬─shard_num─┬─replica_num─┬─host_name──────── ┐ + │ default_cluster_1 │ 1 │ 1 │ node-master1dOnG │ + │ default_cluster_1 │ 1 │ 2 │ node-group-1tXED0001 │ + │ default_cluster_1 │ 2 │ 1 │ node-master2OXQS │ + │ default_cluster_1 │ 2 │ 2 │ node-group-1tXED0002 │ + │ default_cluster_1 │ 3 │ 1 │ node-master3QsRI │ + │ default_cluster_1 │ 3 │ 2 │ node-group-1tXED0003 │ + └─────────────── ┴────── ┴─────── ┴──────────────┘ + + 6 rows in set. Elapsed: 0.001 sec. + +#. Query the shard and replica identifiers. + + **select \* from system.macros**; + + .. code-block:: + + SELECT * + FROM system.macros + + ┌─macro───┬─substitution─────┐ + │ id │ 76 │ + │ replica │ node-master3QsRI │ + │ shard │ 3 │ + └────── ┴────────────┘ + + 3 rows in set. Elapsed: 0.001 sec. + +.. _mrs_01_2398__section1564103819477: + +Creating a Local Replicated Table and a distributed Table +--------------------------------------------------------- + +#. Log in to the ClickHouse node using the client, for example, **clickhouse client --host** *node-master3QsRI* **--multiline --port 9440 --secure;** + + .. note:: + + *node-master3QsRI* is the value of **host_name** obtained in :ref:`2 ` in :ref:`Viewing cluster and Other Environment Parameters of ClickHouse `. + +#. .. _mrs_01_2398__li89698281356: + + Create a replicated table using the ReplicatedMergeTree engine. + + For details about the syntax, see https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/replication/#creating-replicated-tables. + + For example, run the following commands to create a ReplicatedMergeTree table named **test** on the **default_cluster_1** node and in the **default** database: + + **CREATE TABLE** *default.test* **ON CLUSTER** *default_cluster_1* + + **(** + + **\`EventDate\` DateTime,** + + **\`id\` UInt64** + + **)** + + **ENGINE = ReplicatedMergeTree('**\ */clickhouse/tables/{shard}/default/test*\ **', '**\ *{replica}*'**)** + + **PARTITION BY toYYYYMM(EventDate)** + + **ORDER BY id;** + + The parameters are described as follows: + + - The **ON CLUSTER** syntax indicates the distributed DDL, that is, the same local table can be created on all instances in the cluster after the statement is executed once. + - **default_cluster_1** is the cluster identifier obtained in :ref:`2 ` in :ref:`Viewing cluster and Other Environment Parameters of ClickHouse `. + + .. caution:: + + **ReplicatedMergeTree** engine receives the following two parameters: + + - Storage path of the table data in ZooKeeper + + The path must be in the **/clickhouse** directory. Otherwise, data insertion may fail due to insufficient ZooKeeper quota. + + To avoid data conflict between different tables in ZooKeeper, the directory must be in the following format: + + */clickhouse/tables/{shard}*\ **/**\ *default/test*, in which **/clickhouse/tables/{shard}** is fixed, *default* indicates the database name, and *text* indicates the name of the created table. + + - Replica name: Generally, **{replica}** is used. + + .. code-block:: + + CREATE TABLE default.test ON CLUSTER default_cluster_1 + ( + `EventDate` DateTime, + `id` UInt64 + ) + ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/default/test', '{replica}') + PARTITION BY toYYYYMM(EventDate) + ORDER BY id + + ┌─host─────────────────┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐ + │ node-group-1tXED0002 │ 9000 │ 0 │ │ 5 │ 3 │ + │ node-group-1tXED0003 │ 9000 │ 0 │ │ 4 │ 3 │ + │ node-master1dOnG │ 9000 │ 0 │ │ 3 │ 3 │ + └────────────────────┴────┴─────┴──── ┴─────────── ┴──────────┘ + ┌─host─────────────────┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐ + │ node-master3QsRI │ 9000 │ 0 │ │ 2 │ 0 │ + │ node-group-1tXED0001 │ 9000 │ 0 │ │ 1 │ 0 │ + │ node-master2OXQS │ 9000 │ 0 │ │ 0 │ 0 │ + └────────────────────┴────┴─────┴──── ┴─────────── ┴──────────┘ + + 6 rows in set. Elapsed: 0.189 sec. + +#. .. _mrs_01_2398__li16616143173215: + + Create a distributed table using the Distributed engine. + + For example, run the following commands to create a distributed table named **test_all** on the **default_cluster_1** node and in the **default** database: + + **CREATE TABLE** *default.test_all* **ON CLUSTER** *default_cluster_1* + + **(** + + **\`EventDate\` DateTime,** + + **\`id\` UInt64** + + **)** + + **ENGINE = Distributed(**\ *default_cluster_1, default, test, rand()*\ **);** + + .. code-block:: + + CREATE TABLE default.test_all ON CLUSTER default_cluster_1 + ( + `EventDate` DateTime, + `id` UInt64 + ) + ENGINE = Distributed(default_cluster_1, default, test, rand()) + + ┌─host─────────────────┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐ + │ node-group-1tXED0002 │ 9000 │ 0 │ │ 5 │ 0 │ + │ node-master3QsRI │ 9000 │ 0 │ │ 4 │ 0 │ + │ node-group-1tXED0003 │ 9000 │ 0 │ │ 3 │ 0 │ + │ node-group-1tXED0001 │ 9000 │ 0 │ │ 2 │ 0 │ + │ node-master1dOnG │ 9000 │ 0 │ │ 1 │ 0 │ + │ node-master2OXQS │ 9000 │ 0 │ │ 0 │ 0 │ + └────────────────────┴────┴─────┴──── ┴─────────── ┴──────────┘ + + 6 rows in set. Elapsed: 0.115 sec. + + .. note:: + + **Distributed** requires the following parameters: + + - **default_cluster_1** is the cluster identifier obtained in :ref:`2 ` in :ref:`Viewing cluster and Other Environment Parameters of ClickHouse `. + + - **default** indicates the name of the database where the local table is located. + + - **test** indicates the name of the local table. In this example, it is the name of the table created in :ref:`2 `. + + - (Optional) Sharding key + + This key and the weight configured in the **config.xml** file determine the route for writing data to the distributed table, that is, the physical table to which the data is written. It can be the original data (for example, **site_id**) of a column in the table or the result of the function call, for example, **rand()** is used in the preceding SQL statement. Note that data must be evenly distributed in this key. Another common operation is to use the hash value of a column with a large difference, for example, **intHash64(user_id)**. + +ClickHouse Table Data Operations +-------------------------------- + +#. Log in to the ClickHouse node on the client. Example: + + **clickhouse client --host** *node-master3QsRI* **--multiline --port 9440 --secure;** + + .. note:: + + *node-master3QsRI* is the value of **host_name** obtained in :ref:`2 ` in :ref:`Viewing cluster and Other Environment Parameters of ClickHouse `. + +#. .. _mrs_01_2398__li77990531075: + + After creating a table by referring to :ref:`Creating a Local Replicated Table and a distributed Table `, you can insert data to the local table. + + For example, run the following command to insert data to the local table **test**: + + **insert into test values(toDateTime(now()), rand());** + +#. Query the local table information. + + For example, run the following command to query data information of the table **test** in :ref:`2 `: + + **select \* from test;** + + .. code-block:: + + SELECT * + FROM test + + ┌───────────EventDate─┬─────────id─┐ + │ 2020-11-05 21:10:42 │ 1596238076 │ + └──────────────── ┴───────────┘ + + 1 rows in set. Elapsed: 0.002 sec. + + +#. Query the distributed table. + + For example, the distributed table **test_all** is created based on table **test** in :ref:`3 `. Therefore, the same data in table **test** can also be queried in table **test_all**. + + **select \* from test_all;** + + .. code-block:: + + SELECT * + FROM test_all + + ┌───────────EventDate─┬─────────id─┐ + │ 2020-11-05 21:10:42 │ 1596238076 │ + └──────────────── ┴───────────┘ + + 1 rows in set. Elapsed: 0.004 sec. + +#. Switch to the shard node with the same **shard_num** and query the information about the current table. The same table data can be queried. + + For example, run the **exit;** command to exit the original node. + + Run the following command to switch to the **node-group-1tXED0003** node: + + **clickhouse client --host** *node-group-1tXED0003* **--multiline --port 9440 --secure;** + + .. note:: + + The **shard_num** values of **node-group-1tXED0003** and **node-master3QsRI** are the same by performing :ref:`2 `. + + **show tables;** + + .. code-block:: + + SHOW TABLES + + ┌─name─────┐ + │ test │ + │ test_all │ + └────────┘ + + +#. Query the local table data. For example, run the following command to query data in table **test** on the **node-group-1tXED0003** node: + + **select \* from test;** + + .. code-block:: + + SELECT * + FROM test + + ┌───────────EventDate─┬─────────id─┐ + │ 2020-11-05 21:10:42 │ 1596238076 │ + └──────────────── ┴───────────┘ + + 1 rows in set. Elapsed: 0.005 sec. + +#. Switch to the shard node with different **shard_num** value and query the data of the created table. + + For example, run the following command to exit the **node-group-1tXED0003** node: + + **exit;** + + Switch to the **node-group-1tXED0001** node. The **shard_num** values of **node-group-1tXED0001** and **node-master3QsRI** are different by performing :ref:`2 `. + + **clickhouse client --host** *node-group-1tXED0001* **--multiline --port 9440 --secure;** + + Query the local table **test**. Data cannot be queried on the different shard node because table **test** is a local table. + + **select \* from test;** + + .. code-block:: + + SELECT * + FROM test + + Ok. + + Query data in the distributed table **test_all**. The data can be queried properly. + + **select \* from test_all;** + + .. code-block:: + + SELECT * + FROM test + + ┌───────────EventDate─┬─────────id─┐ + │ 2020-11-05 21:12:19 │ 3686805070 │ + └──────────────── ┴───────────┘ + + 1 rows in set. Elapsed: 0.002 sec. + diff --git a/doc/component-operation-guide/source/using_clickhouse/index.rst b/doc/component-operation-guide/source/using_clickhouse/index.rst new file mode 100644 index 0000000..0ffb66c --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/index.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_2344.html + +.. _mrs_01_2344: + +Using ClickHouse +================ + +- :ref:`Using ClickHouse from Scratch ` +- :ref:`ClickHouse Table Engine Overview ` +- :ref:`Creating a ClickHouse Table ` +- :ref:`Common ClickHouse SQL Syntax ` +- :ref:`Migrating ClickHouse Data ` +- :ref:`User Management and Authentication ` +- :ref:`Backing Up and Restoring ClickHouse Data Using a Data File ` +- :ref:`ClickHouse Log Overview ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + using_clickhouse_from_scratch + clickhouse_table_engine_overview + creating_a_clickhouse_table + common_clickhouse_sql_syntax/index + migrating_clickhouse_data/index + user_management_and_authentication/index + backing_up_and_restoring_clickhouse_data_using_a_data_file + clickhouse_log_overview diff --git a/doc/component-operation-guide/source/using_clickhouse/migrating_clickhouse_data/index.rst b/doc/component-operation-guide/source/using_clickhouse/migrating_clickhouse_data/index.rst new file mode 100644 index 0000000..e7bc4ae --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/migrating_clickhouse_data/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_24250.html + +.. _mrs_01_24250: + +Migrating ClickHouse Data +========================= + +- :ref:`Using ClickHouse to Import and Export Data ` +- :ref:`Synchronizing Kafka Data to ClickHouse ` +- :ref:`Using the ClickHouse Data Migration Tool ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + using_clickhouse_to_import_and_export_data + synchronizing_kafka_data_to_clickhouse + using_the_clickhouse_data_migration_tool diff --git a/doc/component-operation-guide/source/using_clickhouse/migrating_clickhouse_data/synchronizing_kafka_data_to_clickhouse.rst b/doc/component-operation-guide/source/using_clickhouse/migrating_clickhouse_data/synchronizing_kafka_data_to_clickhouse.rst new file mode 100644 index 0000000..b89e640 --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/migrating_clickhouse_data/synchronizing_kafka_data_to_clickhouse.rst @@ -0,0 +1,193 @@ +:original_name: mrs_01_24377.html + +.. _mrs_01_24377: + +Synchronizing Kafka Data to ClickHouse +====================================== + +This section describes how to create a Kafka table to automatically synchronize Kafka data to the ClickHouse cluster. + +Prerequisites +------------- + +- You have created a Kafka cluster. The Kafka client has been installed. +- A ClickHouse cluster has been created. It is in the same VPC as the Kafka cluster and can communicate with each other. +- The ClickHouse client has been installed. + +Constraints +----------- + +Currently, ClickHouse cannot interconnect with Kafka clusters with security mode enabled. + +.. _mrs_01_24377__section10908164973416: + +Syntax of the Kafka Table +------------------------- + +- **Syntax** + + .. code-block:: + + CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] + ( + name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1], + name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2], + ... + ) ENGINE = Kafka() + SETTINGS + kafka_broker_list = 'host1:port1,host2:port2', + kafka_topic_list = 'topic1,topic2,...', + kafka_group_name = 'group_name', + kafka_format = 'data_format'; + [kafka_row_delimiter = 'delimiter_symbol',] + [kafka_schema = '',] + [kafka_num_consumers = N] + +- **Parameter description** + + .. table:: **Table 1** Kafka table parameters + + +-----------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Mandatory | Description | + +=======================+=======================+===========================================================================================================================================================================================================================================================================================+ + | kafka_broker_list | Yes | A list of Kafka broker instances, separated by comma (,). For example, *IP address 1 of the Kafka broker instance*:**9092**,\ *IP address 2 of the Kafka broker instance*:**9092**,\ *IP address 3 of the Kafka broker instance*:**9092**. | + | | | | + | | | To obtain the IP address of the Kafka broker instance, perform the following steps: | + | | | | + | | | - For versions earlier than MRS 3.x, click the cluster name to go to the cluster details page and choose **Components** > **Kafka**. Click **Instances** to query the IP addresses of the Kafka instances. | + | | | | + | | | .. note:: | + | | | | + | | | If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) | + | | | | + | | | - For MRS 3.\ *x* or later, log in to FusionInsight Manager and choose **Cluster** > *Name of the desired cluster* > **Services** > **Kafka**. Click **Instances** to query the IP addresses of the Kafka instances. | + +-----------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka_topic_list | Yes | A list of Kafka topics. | + +-----------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka_group_name | Yes | A group of Kafka consumers, which can be customized. | + +-----------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka_format | Yes | Kafka message format, for example, JSONEachRow, CSV, and XML. | + +-----------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka_row_delimiter | No | Delimiter character, which ends a message. | + +-----------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka_schema | No | Parameter that must be used if the format requires a schema definition. | + +-----------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka_num_consumers | No | Number of consumers in per table. The default value is **1**. If the throughput of a consumer is insufficient, more consumers are required. The total number of consumers cannot exceed the number of partitions in a topic because only one consumer can be allocated to each partition. | + +-----------------------+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +How to Synchronize Kafka Data to ClickHouse +------------------------------------------- + +#. .. _mrs_01_24377__li58847364569: + + Switch to the Kafka client installation directory. For details, see :ref:`Using the Kafka Client `. + + a. Log in to the node where the Kafka client is installed as the Kafka client installation user. + + b. Run the following command to go to the client installation directory: + + **cd /opt/client** + + c. Run the following command to configure environment variables: + + **source bigdata_env** + + d. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. If Kerberos authentication is disabled for the current cluster, skip this step. + + #. Run the following command first for an MRS 3.1.0 cluster: + + **export CLICKHOUSE_SECURITY_ENABLED=true** + + #. **kinit** *Component service user* + +#. .. _mrs_01_24377__li133267241488: + + Run the following command to create a Kafka topic. For details, see :ref:`Managing Kafka Topics `. + + **kafka-topics.sh --topic** *kafkacktest2* **--create --zookeeper** *IP address of the Zookeeper role instance:2181*\ **/kafka --partitions** *2* **--replication-factor** *1* + + .. note:: + + - **--topic** is the name of the topic to be created, for example, **kafkacktest2**. + - **--zookeeper** is the IP address of the node where the ZooKeeper role instances are located, which can be the IP address of any of the three role instances. You can obtain the IP address of the node by performing the following steps: + + - For versions earlier than MRS 3.x, click the cluster name to go to the cluster details page and choose **Components** > **ZooKeeper** > **Instances**. View the IP addresses of the ZooKeeper role instances. + - For MRS 3.\ *x* or later, log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Choose **Cluster** > *Name of the desired cluster* > **Services** > **ZooKeeper** > **Instance**. View the IP addresses of the ZooKeeper role instances. + + - **--partitions** and **--replication-factor** are the topic partitions and topic backup replicas, respectively. The number of the two parameters cannot exceed the number of Kafka role instances. + +#. .. _mrs_01_24377__li64680261586: + + Log in to the ClickHouse client by referring to :ref:`Using ClickHouse from Scratch `. + + a. Run the following command to go to the client installation directory: + + **cd /opt/Bigdata/client** + + b. Run the following command to configure environment variables: + + **source bigdata_env** + + c. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. The user must have the permission to create ClickHouse tables. Therefore, you need to bind the corresponding role to the user. For details, see :ref:`ClickHouse User and Permission Management `. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *Component service user* + + Example: **kinit clickhouseuser** + + d. Run the following command to connect to the ClickHouse instance node to which data is to be imported: + + **clickhouse client --host** *IP address of the ClickHouse instance* **--user** *Login username* **--password** **--port** *ClickHouse port number* **--database** *Database name* **--multiline** + + *Enter the user password.* + +#. Create a Kafka table in ClickHouse by referring to :ref:`Syntax of the Kafka Table `. For example, the following table creation statement is used to create a Kafka table whose name is **kafka_src_tbl3**, topic name is **kafkacktest2**, and message format is **JSONEachRow** in the default database. + + .. code-block:: + + create table kafka_src_tbl3 on cluster default_cluster + (id UInt32, age UInt32, msg String) + ENGINE=Kafka() + SETTINGS + kafka_broker_list='IP address 1 of the Kafka broker instance:9092,IP address 2 of the Kafka broker instance:9092,IP address 3 of the Kafka broker instance:9092', + kafka_topic_list='kafkacktest2', + kafka_group_name='cg12', + kafka_format='JSONEachRow'; + +#. Create a ClickHouse replicated table, for example, the ReplicatedMergeTree table named **kafka_dest_tbl3**. + + .. code-block:: + + create table kafka_dest_tbl3 on cluster default_cluster + ( id UInt32, age UInt32, msg String ) + engine = ReplicatedMergeTree('/clickhouse/tables/{shard}/default/kafka_dest_tbl3', '{replica}') + partition by age + order by id; + +#. Create a materialized view, which converts data in Kafka in the background and saves the data to the created ClickHouse table. + + .. code-block:: + + create materialized view consumer3 on cluster default_cluster to kafka_dest_tbl3 as select * from kafka_src_tbl3; + +#. Perform :ref:`1 ` again to go to the Kafka client installation directory. + +#. Run the following command to send a message to the topic created in :ref:`2 `: + + **kafka-console-producer.sh --broker-list** *IP address 1 of the kafka broker instance*\ **:9092,**\ *IP address 2 of the kafka broker instance*\ **:9092,**\ *IP address 3 of the kafka broker instance*\ **:9092** **--topic** *kafkacktest2* + + .. code-block:: + + >{"id":31, "age":30, "msg":"31 years old"} + >{"id":32, "age":30, "msg":"31 years old"} + >{"id":33, "age":30, "msg":"31 years old"} + >{"id":35, "age":30, "msg":"31 years old"} + +#. Use the ClickHouse client to log in to the ClickHouse instance node in :ref:`3 ` and query the ClickHouse table data, for example, to query the replicated table **kafka_dest_tbl3**. It shows that the data in the Kafka message has been synchronized to this table. + + .. code-block:: + + select * from kafka_dest_tbl3; + + |image1| + +.. |image1| image:: /_static/images/en-us_image_0000001437950709.png diff --git a/doc/component-operation-guide/source/using_clickhouse/migrating_clickhouse_data/using_clickhouse_to_import_and_export_data.rst b/doc/component-operation-guide/source/using_clickhouse/migrating_clickhouse_data/using_clickhouse_to_import_and_export_data.rst new file mode 100644 index 0000000..e6a2f89 --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/migrating_clickhouse_data/using_clickhouse_to_import_and_export_data.rst @@ -0,0 +1,107 @@ +:original_name: mrs_01_24206.html + +.. _mrs_01_24206: + +Using ClickHouse to Import and Export Data +========================================== + + +Using ClickHouse to Import and Export Data +------------------------------------------ + +This section describes the basic syntax and usage of the SQL statements for importing and exporting file data using ClickHouse. + +- Importing data in CSV format + + **clickhouse client --host** *Host name or IP address of the ClickHouse instance* **--database** *Database name* **--port** *Port number* **--secure --format_csv_delimiter="**\ *CSV file delimiter*\ **" --query="INSERT INTO** *Table name* **FORMAT CSV" <** *Host path where the CSV file is stored* + + Example + + .. code-block:: + + clickhouse client --host 10.5.208.5 --database testdb --port 9440 --secure --format_csv_delimiter="," --query="INSERT INTO testdb.csv_table FORMAT CSV" < /opt/data + + You need to create a table in advance. + +- Exporting data in CSV format + + .. caution:: + + Exporting data files in CSV format may cause CSV injection. Exercise caution when performing this operation. + + **clickhouse client --host** *Host name or IP address of the ClickHouse instance* **--database** *Database name* **--port** *Port number* **-m --secure --query=**"SELECT \* **FROM** *Table name*" > *CSV file export path* + + Example + + .. code-block:: + + clickhouse client --host 10.5.208.5 --database testdb --port 9440 -m --secure --query="SELECT * FROM test_table" > /opt/test + +- Importing data in Parquet format + + **cat** *Parquet file* **\| clickhouse client --host** *Host name or IP address of the ClickHouse instance* **--database** *Database name* **--port** *Port number* **-m --secure --query="INSERT INTO** *Table name* **FORMAT Parquet"** + + Example + + .. code-block:: + + cat /opt/student.parquet | clickhouse client --host 10.5.208.5 --database testdb --port 9440 -m --secure --query="INSERT INTO parquet_tab001 FORMAT Parquet" + +- Exporting data in Parquet format + + **clickhouse client --host** *Host name or IP address of the ClickHouse instance* **--database** *Database name* **--port** *Port number* **-m --secure --query=**"**select** \* **from** *Table name* **FORMAT Parquet**" > *Parquet file export path* + + Example + + .. code-block:: + + clickhouse client --host 10.5.208.5 --database testdb --port 9440 -m --secure --query="select * from test_table FORMAT Parquet" > /opt/student.parquet + +- Importing data in ORC format + + **cat** *ORC file path* **\| clickhouse client --host** *Host name or IP address of the ClickHouse instance* **--database** *Database name* **--port** *Port number* **-m --secure --query=**"**INSERT INTO** *Table name* **FORMAT ORC**" + + Example + + .. code-block:: + + cat /opt/student.orc | clickhouse client --host 10.5.208.5 --database testdb --port 9440 -m --secure --query="INSERT INTO orc_tab001 FORMAT ORC" + # Data in the ORC file can be exported from HDFS. For example: + hdfs dfs -cat /user/hive/warehouse/hivedb.db/emp_orc/000000_0_copy_1 | clickhouse client --host 10.5.208.5 --database testdb --port 9440 -m --secure --query="INSERT INTO orc_tab001 FORMAT ORC" + +- Exporting data in ORC format + + **clickhouse client --host** *Host name or IP address of the ClickHouse instance* **--database** *Database name* **--port** *Port number* **-m** **--secure --query=**"**select** \* **from** *Table name* **FORMAT ORC**" > *ORC file export path* + + Example + + .. code-block:: + + clickhouse client --host 10.5.208.5 --database testdb --port 9440 -m --secure --query="select * from csv_tab001 FORMAT ORC" > /opt/student.orc + +- Importing data in JSON format + + **INSERT INTO** *Table name* **FORMAT JSONEachRow** *JSON string* *1* *JSON string 2* + + Example + + .. code-block:: + + INSERT INTO test_table001 FORMAT JSONEachRow {"PageViews":5, "UserID":"4324182021466249494", "Duration":146,"Sign":-1} {"UserID":"4324182021466249494","PageViews":6,"Duration":185,"Sign":1} + +- Exporting data in JSON format + + **clickhouse client --host** *Host name or IP address of the ClickHouse instance* **--database** *Database name* **--port** *Port number* **-m --secure --query=**"**SELECT** \* **FROM** *Table name* **FORMAT JSON|JSONEachRow|JSONCompact|...**" > *JSON file export path* + + Example + + .. code-block:: + + # Export JSON file. + clickhouse client --host 10.5.208.5 --database testdb --port 9440 -m --secure --query="SELECT * FROM test_table FORMAT JSON" > /opt/test.json + + # Export json(JSONEachRow). + clickhouse client --host 10.5.208.5 --database testdb --port 9440 -m --secure --query="SELECT * FROM test_table FORMAT JSONEachRow" > /opt/test_jsoneachrow.json + + # Export json(JSONCompact). + clickhouse client --host 10.5.208.5 --database testdb --port 9440 -m --secure --query="SELECT * FROM test_table FORMAT JSONCompact" > /opt/test_jsoncompact.json diff --git a/doc/component-operation-guide/source/using_clickhouse/migrating_clickhouse_data/using_the_clickhouse_data_migration_tool.rst b/doc/component-operation-guide/source/using_clickhouse/migrating_clickhouse_data/using_the_clickhouse_data_migration_tool.rst new file mode 100644 index 0000000..4d0af63 --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/migrating_clickhouse_data/using_the_clickhouse_data_migration_tool.rst @@ -0,0 +1,65 @@ +:original_name: mrs_01_24198.html + +.. _mrs_01_24198: + +Using the ClickHouse Data Migration Tool +======================================== + +The ClickHouse data migration tool can migrate some partitions of one or more partitioned MergeTree tables on several ClickHouseServer nodes to the same tables on other ClickHouseServer nodes. In the capacity expansion scenario, you can use this tool to migrate data from an original node to a new node to balance data after capacity expansion. + +Prerequisites +------------- + +- The ClickHouse and Zookeeper services are running properly. The ClickHouseServer instances on the source and destination nodes are normal. +- The destination node has the data table to be migrated and the table is a partitioned MergeTree table. +- Before creating a migration task, ensure that all tasks for writing data to a table to be migrated have been stopped. After the task is started, you can only query the table to be migrated and cannot write data to or delete data from the table. Otherwise, data may be inconsistent before and after the migration. +- The ClickHouse data directory on the destination node has sufficient space. + +Procedure +--------- + +#. Log in to Manager and choose **Cluster** > **Services** > **ClickHouse**. On the ClickHouse service page, click the **Data Migration** tab. +#. Click **Add Task**. + +3. On the page for creating a migration task, set the migration task parameters. For details, see :ref:`Table 1 `. + + .. _mrs_01_24198__table1724256152117: + + .. table:: **Table 1** Migration task parameters + + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+=====================================================================================================================================================================================+ + | Task Name | Enter a specific task name. The value can contain 1 to 50 characters, including letters, arrays, and underscores (_), and cannot be the same as that of an existing migration task. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Task Type | - **Scheduled Task**: When the scheduled task is selected, you can set **Started** to specify a time point later than the current time to execute the task. | + | | - **Immediate task**: The task is executed immediately after it is started. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Started | Set this parameter when **Task Type** is set to **Scheduled Task**. The valid value is a time point within 90 days from now. | + +-----------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +4. On the **Select Node** page, specify **Source Node Host Name** and **Destination Node Host Name**, and click **Next**. + + .. note:: + + - Only one host name can be entered in **Source Node Host Name** and **Destination Node Host Name**, respectively. Multi-node migration is not supported. + + To obtain the parameter values, click the **Instance** tab on the ClickHouse service page and view the **Host Name** column of the current ClickHouseServer instance. + + - **Maximum Bandwidth** is optional. If it is not specified, there is no upper limit. The maximum bandwidth can be set to **10000** MB/s. + +5. On the **Select Data Table** page, click |image1| next to **Database**, select the database to be migrated on the source node, and select the data table to be migrated for **Data Table**. The data table drop-down list displays the partitioned MergeTree tables in the selected database. In the **Node Information** area, the space usage of the ClickHouse service data directory on the current source and destination nodes is displayed. Click **Next**. + +6. Confirm the task information and click **Submit**. + + The data migration tool automatically calculates the partitions to be migrated based on the size of the data table. The amount of data to be migrated is the total size of the partitions to be migrated. + +7. After the migration task is submitted, click **Start** in the **Operation** column. If the task is an immediate task, the task starts to be executed. If the task is a scheduled task, the countdown starts. + +8. During the migration task execution, you can click **Cancel** to cancel the migration task that is being executed. If you cancel the task, the migrated data on the destination node will be rolled back. + + You can choose **More** > **Details** to view the log information during the migration. + +9. After the migration is complete, choose **More** > **Results** to view the migration result and choose **More** > **Delete** to delete the directories related to the migration task on ZooKeeper and the source node. + +.. |image1| image:: /_static/images/en-us_image_0000001349170269.png diff --git a/doc/component-operation-guide/source/using_clickhouse/user_management_and_authentication/clickhouse_user_and_permission_management.rst b/doc/component-operation-guide/source/using_clickhouse/user_management_and_authentication/clickhouse_user_and_permission_management.rst new file mode 100644 index 0000000..1e4d3ee --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/user_management_and_authentication/clickhouse_user_and_permission_management.rst @@ -0,0 +1,202 @@ +:original_name: mrs_01_24057.html + +.. _mrs_01_24057: + +ClickHouse User and Permission Management +========================================= + +User Permission Model +--------------------- + +ClickHouse user permission management enables unified management of users, roles, and permissions on each ClickHouse instance in the cluster. You can use the permission management module of the Manager UI to create users, create roles, and bind the ClickHouse access permissions. User permissions are controlled by binding roles to users. + +Resource management: :ref:`Table 1 ` lists the resources supported by ClickHouse permission management. + +Resource permissions: :ref:`Table 2 ` lists the resource permissions supported by ClickHouse. + +.. _mrs_01_24057__table858112220269: + +.. table:: **Table 1** Permission management objects supported by ClickHouse + + ======== ============= ============== + Resource Integration Remarks + ======== ============= ============== + Database Yes (level 1) ``-`` + Table Yes (level 2) ``-`` + View Yes (level 2) Same as tables + ======== ============= ============== + +.. _mrs_01_24057__table20282143414276: + +.. table:: **Table 2** Resource permission list + + ========== ==================== ===================================== + Resource Available Permission Remarks + ========== ==================== ===================================== + Database CREATE CREATE DATABASE/TABLE/VIEW/DICTIONARY + Table/View SELECT/INSERT ``-`` + ========== ==================== ===================================== + +Prerequisites +------------- + +- The ClickHouse and Zookeeper services are running properly. +- When creating a database or table in the cluster, the **ON CLUSTER** statement is used to ensure that the metadata of the database and table on each ClickHouse node is the same. + +.. note:: + + After the permission is granted, it takes about 1 minute for the permission to take effect. + +.. _mrs_01_24057__section1688472043712: + +Adding the ClickHouse Role +-------------------------- + +#. Log in to Manager and choose **System** > **Permission** > **Role**. On the **Role** page, click **Create Role**. + +#. On the **Create Role** page, specify **Role Name**. In the **Configure Resource Permission** area, click the cluster name. On the service list page that is displayed, click the ClickHouse service. + + Determine whether to create a role with ClickHouse administrator permission based on service requirements. + + .. note:: + + - The ClickHouse administrator has all the database operation permissions except the permissions to create, delete, and modify users and roles. + - Only the built-in user **clickhouse** of ClickHouse has the permission to manage users and roles. + + - If yes, go to :ref:`3 `. + - If no, go to :ref:`4 `. + +#. .. _mrs_01_24057__li9365913184120: + + Select **SUPER_USER_GROUP** and click **OK**. + +#. .. _mrs_01_24057__li13347154819413: + + Click **ClickHouse Scope**. The ClickHouse database resource list is displayed. If you select **create**, the role has the create permission on the database. + + Determine whether to grant the permission based on the service requirements. + + - If yes, click **OK**. + - If no, go to :ref:`5 `. + +#. .. _mrs_01_24057__li17964516204412: + + Click the resource name and select the *Database resource name to be operated*. On the displayed page, select **READ** (SELECT permission) or **WRITE** (INSERT permission) based on service requirements, and click **OK**. + +Adding a User and Binding the ClickHouse Role to the User +--------------------------------------------------------- + +#. .. _mrs_01_24057__li1183214191540: + + Log in to Manager and choose **System** > **Permission** > **User** and click **Create**. + +#. Select **Human-Machine** for **User Type** and set **Password** and **Confirm Password** to the password of the user. + + .. note:: + + - Username: The username cannot contain hyphens (-). Otherwise, the authentication will fail. + - Password: The password cannot contain special characters $, ., and #. Otherwise, the authentication will fail. + +#. In the **Role** area, click **Add**. In the displayed dialog box, select a role with the ClickHouse permission and click **OK** to add the role. Then, click **OK**. + +#. Log in to the node where the ClickHouse client is installed and use the new username and password to connect to the ClickHouse service. + + - Run the following command to go to the client installation directory: + + **cd /opt/**\ *Client installation directory* + + - Run the following command to configure environment variables: + + **source bigdata_env** + + - If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. The user must have the permission to create ClickHouse tables. Therefore, you need to bind the corresponding role to the user. For details, see :ref:`Adding the ClickHouse Role `. If Kerberos authentication is disabled for the current cluster, skip this step. + + a. Run the following command if it is an MRS 3.1.0 cluster: + + **export CLICKHOUSE_SECURITY_ENABLED=true** + + b. **kinit** *User added in :ref:`1 `* + + - Log in to the system as the new user. + + **Cluster with Kerberos authentication disabled:** + + **clickhouse client --host** *IP address of the ClickHouse instance* **--multiline** **--port** *ClickHouse port number* **--secure** + + **Cluster with Kerberos authentication disabled:** + + **clickhouse client --host** *IP address of the ClickHouse instance*\ **--user** *Username* **--password** **--port** 9440 **--secure** + + *Enter the user password.* + + .. note:: + + The user in normal mode is the default user, or you can create an administrator using the open source capability provided by the ClickHouse community. You cannot use the users created on FusionInsight Manager. + +Granting Permissions Using the Client in Abnormal Scenarios +----------------------------------------------------------- + +By default, the table metadata on each node of the ClickHouse cluster is the same. Therefore, the table information on a random ClickHouse node is collected on the permission management page of Manager. If the **ON CLUSTER** statement is not used when databases or tables are created on some nodes, the resource may fail to be displayed during permission management, and permissions may not be granted to the resource. To grant permissions on the local table on a single ClickHouse node, perform the following steps on the background client. + +.. note:: + + The following operations are performed based on the obtained roles, database or table names, and IP addresses of the node where the corresponding ClickHouseServer instance is located. + + - You can log in to FusionInsight Manager and choose **Cluster** > **Services** > **ClickHouse** > **Instance** to obtain the service IP address of the ClickHouseServer instance. + - The default system domain name is **hadoop.com**. Log in to FusionInsight Manager and choose **System** > **Permission** > **Domain and Mutual Trust**. The value of **Local Domain** is the system domain name. Change the letters to lowercase letters when running a command. + +#. Log in to the node where the ClickHouseServer instance is located as user **root**. + +#. .. _mrs_01_24057__li10408141903516: + + Run the following command to obtain the path of the **clickhouse.keytab** file: + + **ls ${BIGDATA_HOME}/FusionInsight_ClickHouse_*/install/FusionInsight-ClickHouse-*/clickhouse/keytab/clickhouse.keytab** + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + + Run the following command if it is an MRS 3.1.0 cluster with Kerberos authentication enabled: + + **export CLICKHOUSE_SECURITY_ENABLED=true** + +#. Run the following command to connect to the ClickHouseServer instance: + + If Kerberos authentication is enabled for the current cluster, run the following command: + + **clickhouse client --host** *IP address of the node where the ClickHouseServer instance is located* **--user clickhouse/hadoop.**\ ** **--password** *clickhouse.keytab path obtained in :ref:`2 `* **--port** *ClickHouse port number* **--secure** + + If Kerberos authentication is disabled for the current cluster, run the following command: + + **clickhouse client --host** *IP address of the node where the ClickHouseServer instance is located* **--user clickhouse** **--port** *ClickHouse port number* + +#. Run the following statement to grant permissions to a database: + + In the syntax for granting permissions, *DATABASE* indicates the name of the target database, and *role* indicates the target role. + + **GRANT** **[ON CLUSTER** *cluster_name*\ **]** *privilege* **ON** *{DATABASE|TABLE}* **TO** *{user \| role]* + + For example, grant user **testuser** the CREATE permission on database **t2**: + + **GRANT CREATE ON** *m2* **to** *testuser*\ **;** + +#. Run the following commands to grant permissions on the table or view. In the following command, *TABLE* indicates the name of the table or view to be operated, and *user* indicates the role to be operated. + + Run the following command to grant the query permission on tables in a database: + + **GRANT SELECT ON** *TABLE* **TO** *user*\ **;** + + Run the following command to grant the write permission on tables in a database: + + **GRANT INSERT ON** *TABLE* **TO** *user*\ **;** + +#. Run the following command to exit the client: + + **quit;** diff --git a/doc/component-operation-guide/source/using_clickhouse/user_management_and_authentication/index.rst b/doc/component-operation-guide/source/using_clickhouse/user_management_and_authentication/index.rst new file mode 100644 index 0000000..f490e63 --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/user_management_and_authentication/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_24251.html + +.. _mrs_01_24251: + +User Management and Authentication +================================== + +- :ref:`ClickHouse User and Permission Management ` +- :ref:`Interconnecting ClickHouse With OpenLDAP for Authentication ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + clickhouse_user_and_permission_management + interconnecting_clickhouse_with_openldap_for_authentication diff --git a/doc/component-operation-guide/source/using_clickhouse/user_management_and_authentication/interconnecting_clickhouse_with_openldap_for_authentication.rst b/doc/component-operation-guide/source/using_clickhouse/user_management_and_authentication/interconnecting_clickhouse_with_openldap_for_authentication.rst new file mode 100644 index 0000000..ac3a7ff --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/user_management_and_authentication/interconnecting_clickhouse_with_openldap_for_authentication.rst @@ -0,0 +1,183 @@ +:original_name: mrs_01_24109.html + +.. _mrs_01_24109: + +Interconnecting ClickHouse With OpenLDAP for Authentication +=========================================================== + +ClickHouse can be interconnected with OpenLDAP. You can manage accounts and permissions in a centralized manner by adding the OpenLDAP server configuration and creating users on ClickHouse. You can use this method to import users from the OpenLDAP server to ClickHouse in batches. + +This section applies only to MRS 3.1.0 or later. + +Prerequisites +------------- + +- The MRS cluster and ClickHouse instances are running properly, and the ClickHouse client has been installed. +- OpenLDAP has been installed and is running properly. + +Creating a ClickHouse User for Interconnecting with the OpenLDAP Server +----------------------------------------------------------------------- + +#. Log in to Manager and choose **Cluster** > **Services** > **ClickHouse**. Click the **Configurations** tab and then **All Configurations**. + +#. Choose **ClickHouseServer(Role)** > **Customization**, and add the following OpenLDAP configuration parameters to the **clickhouse-config-customize** configuration item. + + .. table:: **Table 1** OpenLDAP parameters + + +------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+---------------------------+ + | Parameter | Description | Example Value | + +================================================+============================================================================================================================================+===========================+ + | ldap_servers.ldap_server_name.host | OpenLDAP server host name or IP address. This parameter cannot be empty. | localhost | + +------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+---------------------------+ + | ldap_servers.ldap_server_name.port | OpenLDAP server port number. | 636 | + | | | | + | | If **enable_tls** is set to **true**, the default port number is **636**. Otherwise, the default port number is **389**. | | + +------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+---------------------------+ + | ldap_servers.ldap_server_name.auth_dn_prefix | Prefix and suffix used to construct the DN to bind to. | uid= | + | | | | + | | The generated DN will be constructed as a string in the following format: **auth_dn_prefix** + **escape(user_name)** + **auth_dn_suffix**. | | + | | | | + | | Use a comma (,) as the first non-space character of **auth_dn_suffix**. | | + +------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+---------------------------+ + | ldap_servers.ldap_server_name.auth_dn_suffix | | ,ou=Group,dc=node1,dc=com | + +------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+---------------------------+ + | ldap_servers.ldap_server_name.enable_tls | A tag to trigger the use of the secure connection to the OpenLDAP server. | yes | + | | | | + | | - Set it to **no** for the plaintext (ldap://) protocol (not recommended). | | + | | - Set it to **yes** for the LDAP over SSL/TLS (ldaps://) protocol. | | + +------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+---------------------------+ + | ldap_servers.ldap_server_name.tls_require_cert | SSL/TLS peer certificate verification behavior. | allow | + | | | | + | | The value can be **never**, **allow**, **try**, or **require**. | | + +------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+---------------------------+ + + .. note:: + + For details about other parameters, see :ref:` Parameters `. + +#. After the configuration is complete, click **Save**. In the displayed dialog box, click **OK**. After the configuration is saved, click **Finish**. + +#. On Manager, click **Instance**, select a ClickHouseServer instance, and choose **More** > **Restart Instance**. In the displayed dialog box, enter the password and click **OK**. In the displayed **Restart instance** dialog box, click **OK**. Confirm that the instance is restarted successfully as prompted and click **Finish**. + +#. Log in to the ClickHouseServer instance node and go to the **${BIGDATA_HOME}/FusionInsight_ClickHouse\_**\ *Version number*\ **/**\ *x_x*\ **\_ClickHouseServer/etc** directory. + + **cd** **${BIGDATA_HOME}/FusionInsight_ClickHouse\_\*/**\ *x_x*\ \_\ **ClickHouseServer/etc** + +#. .. _mrs_01_24109__li111911544142720: + + Run the following command to view the **config.xml** configuration file and check whether the OpenLDAP parameters are configured successfully: + + **cat config.xml** + + |image1| + +#. Log in to the node where the ClickHouseServer instance is located as user **root**. + +#. .. _mrs_01_24109__li10408141903516: + + Run the following command to obtain the path of the **clickhouse.keytab** file: + + **ls ${BIGDATA_HOME}/FusionInsight_ClickHouse_*/install/FusionInsight-ClickHouse-*/clickhouse/keytab/clickhouse.keytab** + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the ClickHouse client installation directory: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. Run the following command to connect to the ClickHouseServer instance: + + - If Kerberos authentication is enabled for the current cluster, use **clickhouse.keytab** to connect to the ClickHouseServer instance. + + **clickhouse client --host** *IP address of the node where the ClickHouseServer instance is located* **--user clickhouse/hadoop.**\ ** **--password** *clickhouse.keytab path obtained in :ref:`8 `* **--port** *ClickHouse port number* + + .. note:: + + The default system domain name is **hadoop.com**. Log in to FusionInsight Manager and choose **System** > **Permission** > **Domain and Mutual Trust**. The value of **Local Domain** is the system domain name. Change the letters to lowercase letters when running a command. + + - If Kerberos authentication is disabled for the current cluster, connect to the ClickHouseServer instance as the **clickhouse** administrator. + + **clickhouse client --host** *IP address of the node where the ClickHouseServer instance is located* **--user clickhouse** **--port** *ClickHouse port number* + +#. Create a common user of OpenLDAP. + + Run the following statement to create user **testUser** in cluster **default_cluster** and set **ldap_server** to the OpenLDAP server name in the **** tag in :ref:`6 `. In this example, the name is **ldap_server_name**. + + **CREATE USER** *testUser* **ON CLUSTER** *default_cluster* **IDENTIFIED WITH ldap_server BY '**\ ldap_server_name\ **';** + + **testUser** indicates an existing username in OpenLDAP. Change it based on the site requirements. + +#. Log out of the client, and then log in to the client as the new user to check whether the configuration is successful. + + **exit;** + + **clickhouse client --host** *IP address of the ClickHouseServer instance* **--user** *testUser* **--password** **--port** *ClickHouse port number* + + *Enter the password of testUser.* + +.. _mrs_01_24109__section16259164716419: + + Parameters +------------------------- + +- **host** + + OpenLDAP server host name or IP address. This parameter is mandatory and cannot be empty. + +- **port** + + Port number of the OpenLDAP server. If **enable_tls** is set to **true**, the default value is **636**. Otherwise, the value is **389**. + +- **auth_dn_prefix, auth_dn_suffix** + + Prefix and suffix used to construct the DN to bind to. + + The generated DN will be constructed as a string in the following format: **auth_dn_prefix** + **escape(user_name)** + **auth_dn_suffix**. + + Note that you should use a comma (,) as the first non-space character of **auth_dn_suffix**. + +- **enable_tls** + + A tag to trigger the use of the secure connection to the OpenLDAP server. + + Set it to **no** for the plaintext (ldap://) protocol (not recommended). + + Set it to **yes** for LDAP over SSL/TLS (ldaps://) protocol (recommended and default). + +- **tls_minimum_protocol_version** + + Minimum protocol version of SSL/TLS. + + The value can be **ssl2**, **ssl3**, **tls1.0**, **tls1.1**, or **tls1.2** (default). + +- **tls_require_cert** + + SSL/TLS peer certificate verification behavior. + + The value can be **never**, **allow**, **try**, or **require** (default). + +- **tls_cert_file** + + Certificate file. + +- **tls_key_file** + + Certificate key file. + +- **tls_ca_cert_file** + + CA certificate file. + +- **tls_ca_cert_dir** + + Directory where the CA certificate is stored. + +- **tls_cipher_suite** + + Allowed encryption suite. + +.. |image1| image:: /_static/images/en-us_image_0000001296090112.png diff --git a/doc/component-operation-guide/source/using_clickhouse/using_clickhouse_from_scratch.rst b/doc/component-operation-guide/source/using_clickhouse/using_clickhouse_from_scratch.rst new file mode 100644 index 0000000..8109293 --- /dev/null +++ b/doc/component-operation-guide/source/using_clickhouse/using_clickhouse_from_scratch.rst @@ -0,0 +1,128 @@ +:original_name: mrs_01_2345.html + +.. _mrs_01_2345: + +Using ClickHouse from Scratch +============================= + +ClickHouse is a column-based database oriented to online analysis and processing. It supports SQL query and provides good query performance. The aggregation analysis and query performance based on large and wide tables is excellent, which is one order of magnitude faster than other analytical databases. + +Prerequisites +------------- + +You have installed the client, for example, in the **/opt/hadoopclient** directory. The client directory in the following operations is only an example. Change it to the actual installation directory. Before using the client, download and update the client configuration file, and ensure that the active management node of Manager is available. + +Procedure +--------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd /opt/hadoopclient** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. The current user must have the permission to create ClickHouse tables. If Kerberos authentication is disabled for the current cluster, skip this step. + + a. Run the following command if it is an MRS 3.1.0 cluster: + + **export CLICKHOUSE_SECURITY_ENABLED=true** + + b. **kinit** *Component service user* + + Example: **kinit clickhouseuser** + +#. Run the client command of the ClickHouse component. + + Run the **clickhouse -h** command to view the command help of ClickHouse. + + The command output is as follows: + + .. code-block:: + + Use one of the following commands: + clickhouse local [args] + clickhouse client [args] + clickhouse benchmark [args] + clickhouse server [args] + clickhouse performance-test [args] + clickhouse extract-from-config [args] + clickhouse compressor [args] + clickhouse format [args] + clickhouse copier [args] + clickhouse obfuscator [args] + ... + + Run the **clickhouse client** command to connect to the ClickHouse serverif MRS 3.1.0 or later. + + - Command for using SSL to log in to a ClickHouse cluster with Kerberos authentication disabled + + **clickhouse client --host** *IP address of the ClickHouse instance*\ **--user** *Username* **--password** **--port** 9440 **--secure** + + *Enter the user password.* + + - Using SSL for login when Kerberos authentication is enabled for the current cluster: + + You must create a user on Manager because there is no default user. + + After the user authentication is successful, you do not need to carry the **--user** and **--password** parameters when logging in to the client as the authenticated user. + + **clickhouse client --host** *IP address of the ClickHouse instance* **--port** 9440 **--secure** + + The following table describes the parameters of the **clickhouse client** command. + + .. table:: **Table 1** Parameters of the **clickhouse client** command + + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+================================================================================================================================================================================================================================================================================================================================================================+ + | --host | Host name of the server. The default value is **localhost**. You can use the host name or IP address of the node where the ClickHouse instance is located. | + | | | + | | .. note:: | + | | | + | | You can log in to FusionInsight Manager and choose **Cluster** > **Services** > **ClickHouse** > **Instance** to obtain the service IP address of the ClickHouseServer instance. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --port | Port for connection. | + | | | + | | - If the SSL security connection is used, the default port number is **9440**, the parameter **--secure** must be carried. For details about the port number, search for the **tcp_port_secure** parameter in the ClickHouseServer instance configuration. | + | | - If non-SSL security connection is used, the default port number is **9000**, the parameter **--secure** does not need to be carried. For details about the port number, search for the **tcp_port** parameter in the ClickHouseServer instance configuration. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --user | Username. | + | | | + | | You can create the user on Manager and bind a role to the user. | + | | | + | | - If Kerberos authentication is enabled for the current cluster and the user authentication is successful, you do not need to carry the **--user** and **--password** parameters when logging in to the client as the authenticated user. You must create a user with this name on Manager because there is no default user in the Kerberos cluster scenario. | + | | | + | | - If Kerberos authentication is not enabled for the current cluster, you can specify a user and its password created on Manager when logging in to the client. If the user and password parameters are not carried, user **default** is used for login by default. | + | | | + | | The user in normal mode (Kerberos authentication disabled) is the default user, or you can create an administrator using the open source capability provided by the ClickHouse community. You cannot use the users created on FusionInsight Manager. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --password | Password. The default password is an empty string. This parameter is used together with the **--user** parameter. You can set a password when creating a user on Manager. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --query | Query to process when using non-interactive mode. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --database | Current default database. The default value is **default**, which is the default configuration on the server. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --multiline | If this parameter is specified, multiline queries are allowed. (**Enter** only indicates line feed and does not indicate that the query statement is complete.) | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --multiquery | If this parameter is specified, multiple queries separated with semicolons (;) can be processed. This parameter is valid only in non-interactive mode. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --format | Specified default format used to output the result. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --vertical | If this parameter is specified, the result is output in vertical format by default. In this format, each value is printed on a separate line, which helps to display a wide table. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --time | If this parameter is specified, the query execution time is printed to **stderr** in non-interactive mode. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --stacktrace | If this parameter is specified, stack trace information will be printed when an exception occurs. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --config-file | Name of the configuration file. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --secure | If this parameter is specified, the server will be connected in SSL mode. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --history_file | Path of files that record command history. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | --param_ | Query with parameters. Pass values from the client to the server. | + +-----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_dbservice/dbservice_log_overview.rst b/doc/component-operation-guide/source/using_dbservice/dbservice_log_overview.rst new file mode 100644 index 0000000..b717d29 --- /dev/null +++ b/doc/component-operation-guide/source/using_dbservice/dbservice_log_overview.rst @@ -0,0 +1,100 @@ +:original_name: mrs_01_0789.html + +.. _mrs_01_0789: + +DBService Log Overview +====================== + +Log Description +--------------- + +**Log path**: The default storage path of DBService log files is **/var/log/Bigdata/dbservice**. + +- GaussDB: **/var/log/Bigdata/dbservice/DB** (GaussDB run log directory), **/var/log/Bigdata/dbservice/scriptlog/gaussdbinstall.log** (GaussDB installation log), and **/var/log/gaussdbuninstall.log** (GaussDB uninstallation log). + +- HA: **/var/log/Bigdata/dbservice/ha/runlog** (HA run log directory) and **/var/log/Bigdata/dbservice/ha/scriptlog** (HA script log directory) + +- DBServer: **/var/log/Bigdata/dbservice/healthCheck** (Directory of service and process health check logs) + + **/var/log/Bigdata/dbservice/scriptlog** (run log directory), **/var/log/Bigdata/audit/dbservice/** (audit log directory) + +Log archive rule: The automatic DBService log compression function is enabled. By default, when the size of logs exceeds 1 MB, logs are automatically compressed into a log file named in the following format: *-[No.]*\ **.gz**. A maximum of 20 latest compressed files are reserved. + +.. note:: + + Log archive rules cannot be modified. + +.. table:: **Table 1** DBService log list + + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | Type | Log File Name | Description | + +=======================+============================+===================================================================================================================================+ + | DBServer run log | dbservice_serviceCheck.log | Run log file of the service check script | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | dbservice_processCheck.log | Run log file of the process check script | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | backup.log | Run logs of backup and restoration operations (The DBService backup and restoration operations need to be performed.) | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | checkHaStatus.log | Log file of HA check records | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | cleanupDBService.log | Uninstallation log file (You need to uninstall DBService logs.) | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | componentUserManager.log | Log file that records the adding and deleting operations on the database by users | + | | | | + | | | (Services that depend on DBService need to be added.) | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | install.log | Installation log file | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | preStartDBService.log | Pre-startup log file | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | start_dbserver.log | DBServer startup operation log file (DBService needs to be started.) | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | stop_dbserver.log | DBServer stop operation log file (DBService needs to be stopped.) | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | status_dbserver.log | Log file of the DBServer status check (You need to execute the **$DBSERVICE_HOME/sbin/status-dbserver.sh** script.) | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | modifyPassword.log | Run log file of changing the DBService password script. (You need to execute the **$DBSERVICE_HOME/sbin/modifyDBPwd.sh** script.) | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | modifyDBPwd_yyyy-mm-dd.log | Run log file that records the DBService password change tool | + | | | | + | | | (You need to execute the **$DBSERVICE_HOME/sbin/modifyDBPwd.sh** script.) | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | dbserver_switchover.log | Log for DBServer to execute the active/standby switchover script (the active/standby switchover needs to be performed) | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | GaussDB run log | gaussdb.log | Log file that records database running information | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | gs_ctl-current.log | Log file that records operations performed by using the **gs_ctl** tool | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | gs_guc-current.log | Log file that records operations, mainly parameter modification performed by using the **gs_guc** tool | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | gaussdbinstall.log | GaussDB installation log file | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | gaussdbuninstall.log | GaussDB uninstallation log file | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | HA script run log | floatip_ha.log | Log file that records the script of floating IP addresses | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | gaussDB_ha.log | Log file that records the script of GaussDB resources | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | ha_monitor.log | Log file that records the HA process monitoring information | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | send_alarm.log | Alarm sending log file | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | | ha.log | HA run log file | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + | DBService audit log | dbservice_audit.log | Audit log file that records DBService operations, such as backup and restoration operations | + +-----------------------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ + +Log Format +---------- + +The following table lists the DBService log formats. + +.. table:: **Table 2** Log format + + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Type | Format | Example | + +===========+=============================================================================================================================================================================+======================================================================================================================================================+ + | Run log | [<*yyyy-MM-dd HH:mm:ss*>] <*Log level*>: [< *Name of the script that generates the log*: *Line number* >]: < *Message in the log*> | [2020-12-19 15:56:42] INFO [postinstall.sh:653] Is cloud flag is false. (main) | + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Audit log | [<*yyyy-MM-dd HH:mm:ss,SSS*>] UserName:<*Username*> UserIP:<*User IP address*> Operation:<*Operation content*> Result:<*Operation results*> Detail:<*Detailed information*> | [2020-05-26 22:00:23] UserName:omm UserIP:192.168.10.21 Operation:DBService data backup Result: SUCCESS Detail: DBService data backup is successful. | + +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_dbservice/index.rst b/doc/component-operation-guide/source/using_dbservice/index.rst new file mode 100644 index 0000000..9d7a010 --- /dev/null +++ b/doc/component-operation-guide/source/using_dbservice/index.rst @@ -0,0 +1,14 @@ +:original_name: mrs_01_2356.html + +.. _mrs_01_2356: + +Using DBService +=============== + +- :ref:`DBService Log Overview ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + dbservice_log_overview diff --git a/doc/component-operation-guide/source/using_flink/common_flink_shell_commands.rst b/doc/component-operation-guide/source/using_flink/common_flink_shell_commands.rst new file mode 100644 index 0000000..139626a --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/common_flink_shell_commands.rst @@ -0,0 +1,178 @@ +:original_name: mrs_01_0598.html + +.. _mrs_01_0598: + +Common Flink Shell Commands +=========================== + +This section applies to MRS 3.\ *x* or later clusters. + +Before running the Flink shell commands, perform the following steps: + +#. Install the Flink client in a directory, for example, **/opt/client**. + +#. Run the following command to initialize environment variables: + + **source /opt/client/bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled, skip this step. + + **kinit** *Service user* + +#. Run the related commands according to :ref:`Table 1 `. + + .. _mrs_01_0598__table65101640171215: + + .. table:: **Table 1** Flink Shell commands + + +--------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Command | Description | Description | + +==============================================================+===========================================================================================================================================================================================================================================================+=====================================================================================================================================================================================================================================================================================+ + | yarn-session.sh | **-at,--applicationType **: Defines the Yarn application type. | Start a resident Flink cluster to receive tasks from the Flink client. | + | | | | + | | **-D **: Configures dynamic parameter. | | + | | | | + | | **-d,--detached**: Disables the interactive mode and starts a separate Flink Yarn session. | | + | | | | + | | **-h,--help**: Displays the help information about the Yarn session CLI. | | + | | | | + | | **-id,--applicationId **: Binds to a running Yarn session. | | + | | | | + | | **-j,--jar **: Sets the path of the user's JAR file. | | + | | | | + | | **-jm,--jobManagerMemory **: Sets the JobManager memory. | | + | | | | + | | **-m,--jobmanager **: Address of the JobManager (master) to which to connect. Use this parameter to connect to a specified JobManager. | | + | | | | + | | **-nl,--nodeLabel **: Specifies the nodeLabel of the Yarn application. | | + | | | | + | | **-nm,--name **: Customizes a name for the application on Yarn. | | + | | | | + | | **-q,--query**: Queries available Yarn resources. | | + | | | | + | | **-qu,--queue **: Specifies a Yarn queue. | | + | | | | + | | **-s,--slots **: Sets the number of slots for each TaskManager. | | + | | | | + | | **-t,--ship **: specifies the directory of the file to be sent. | | + | | | | + | | **-tm,--taskManagerMemory **: sets the TaskManager memory. | | + | | | | + | | **-yd,--yarndetached**: starts Yarn in the detached mode. | | + | | | | + | | **-z,--zookeeperNamespace **: specifies the namespace of ZooKeeper. | | + | | | | + | | **-h**: Gets help information. | | + +--------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | flink run | **-c,--class **: Specifies a class as the entry for running programs. | Submit a Flink job. | + | | | | + | | **-C,--classpath **: Specifies **classpath**. | 1. The **-y\*** parameter is used in the **yarn-cluster** mode. | + | | | | + | | **-d,--detached**: Runs a job in the detached mode. | 2. If the parameter is not **-y\***, you need to run the **yarn-session** command to start the Flink cluster before running this command to submit a task. | + | | | | + | | **-files,--dependencyFiles **: File on which the Flink program depends. | | + | | | | + | | **-n,--allowNonRestoredState**: A state that cannot be restored can be skipped during restoration from a snapshot point in time. For example, if an operator in the program is deleted, you need to add this parameter when restoring the snapshot point. | | + | | | | + | | **-m,--jobmanager **: Specifies the JobManager. | | + | | | | + | | **-p,--parallelism **: Specifies the job DOP, which will overwrite the DOP parameter in the configuration file. | | + | | | | + | | **-q,--sysoutLogging**: Disables the function of outputting Flink logs to the console. | | + | | | | + | | **-s,--fromSavepoint **: Specifies a savepoint path for recovering jobs. | | + | | | | + | | **-z,--zookeeperNamespace **: specifies the namespace of ZooKeeper. | | + | | | | + | | **-yat,--yarnapplicationType **: Defines the Yarn application type. | | + | | | | + | | **-yD **: Dynamic parameter configuration. | | + | | | | + | | **-yd,--yarndetached**: Starts Yarn in the detached mode. | | + | | | | + | | **-yh,--yarnhelp**: Obtains the Yarn help. | | + | | | | + | | **-yid,--yarnapplicationId **: Binds a job to a Yarn session. | | + | | | | + | | **-yj,--yarnjar **: Sets the path to Flink jar file. | | + | | | | + | | **-yjm,--yarnjobManagerMemory **: Sets the JobManager memory (MB). | | + | | | | + | | **-ynm,--yarnname **: Customizes a name for the application on Yarn. | | + | | | | + | | **-yq,--yarnquery**: Queries available Yarn resources (memory and CPUs). | | + | | | | + | | **-yqu,--yarnqueue **: Specifies a Yarn queue. | | + | | | | + | | **-ys,--yarnslots**: Sets the number of slots for each TaskManager. | | + | | | | + | | **-yt,--yarnship **: Specifies the path of the file to be sent. | | + | | | | + | | **-ytm,--yarntaskManagerMemory **: Sets the TaskManager memory (MB). | | + | | | | + | | **-yz,--yarnzookeeperNamespace **: Specifies the namespace of ZooKeeper. The value must be the same as the value of **yarn-session.sh -z**. | | + | | | | + | | **-h**: Gets help information. | | + +--------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | flink info | **-c,--class **: Specifies a class as the entry for running programs. | Display the execution plan (JSON) of the running program. | + | | | | + | | **-p,--parallelism **: Specifies the DOP for running programs. | | + | | | | + | | **-h**: Gets help information. | | + +--------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | flink list | **-a,--all**: displays all jobs. | Query running programs in the cluster. | + | | | | + | | **-m,--jobmanager **: specifies the JobManager. | | + | | | | + | | **-r,--running:** displays only jobs in the running state. | | + | | | | + | | **-s,--scheduled**: displays only jobs in the scheduled state. | | + | | | | + | | **-z,--zookeeperNamespace **: specifies the namespace of ZooKeeper. | | + | | | | + | | **-yid,--yarnapplicationId **: binds a job to a Yarn session. | | + | | | | + | | **-h**: gets help information. | | + +--------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | flink stop | **-d,--drain**: sends MAX_WATERMARK before the savepoint is triggered and the job is stopped. | Forcibly stop a running job (only streaming jobs are supported. **StoppableFunction** needs to be implemented on the source side in service code). | + | | | | + | | **-p,--savepointPath **: path for storing savepoints. The default value is **state.savepoints.dir**. | | + | | | | + | | **-m,--jobmanager **: specifies the JobManager. | | + | | | | + | | **-z,--zookeeperNamespace **: specifies the namespace of ZooKeeper. | | + | | | | + | | **-yid,--yarnapplicationId **: binds a job to a Yarn session. | | + | | | | + | | **-h**: gets help information. | | + +--------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | flink cancel | **-m,--jobmanager **: specifies the JobManager. | Cancel a running job. | + | | | | + | | **-s,--withSavepoint **: triggers a savepoint when a job is canceled. The default directory is **state.savepoints.dir**. | | + | | | | + | | **-z,--zookeeperNamespace **: specifies the namespace of ZooKeeper. | | + | | | | + | | **-yid,--yarnapplicationId **: binds a job to a Yarn session. | | + | | | | + | | **-h**: gets help information. | | + +--------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | flink savepoint | **-d,--dispose **: specifies a directory for storing the savepoint. | Trigger a savepoint. | + | | | | + | | **-m,--jobmanager **: specifies the JobManager. | | + | | | | + | | **-z,--zookeeperNamespace **: specifies the namespace of ZooKeeper. | | + | | | | + | | **-yid,--yarnapplicationId **: binds a job to a Yarn session. | | + | | | | + | | **-h**: gets help information. | | + +--------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | **source** *Client installation directory*\ **/bigdata_env** | None | Import client environment variables. | + | | | | + | | | Restriction: If the user uses a custom script (for example, **A.sh**) and runs this command in the script, variables cannot be imported to the **A.sh** script. If variables need to be imported to the custom script **A.sh**, the user needs to use the secondary calling method. | + | | | | + | | | For example, first call the **B.sh** script in the **A.sh** script, and then run this command in the **B.sh** script. Parameters can be imported to the **A.sh** script but cannot be imported to the **B.sh** script. | + +--------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | start-scala-shell.sh | local \| remote \| yarn: running mode | Start the scala shell. | + +--------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | sh generate_keystore.sh | ``-`` | Run the **generate_keystore.sh** script to generate security cookie, **flink.keystore**, and **flink.truststore**. You need to enter a user-defined password that does not contain number signs (#). | + +--------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/blob.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/blob.rst new file mode 100644 index 0000000..b2a2ccc --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/blob.rst @@ -0,0 +1,38 @@ +:original_name: mrs_01_1567.html + +.. _mrs_01_1567: + +Blob +==== + +Scenarios +--------- + +The Blob server on the JobManager node is used to receive JAR files uploaded by users on the client, send JAR files to TaskManager, and transfer log files. Flink provides some items for configuring the Blob server. You can configure them in the **flink-conf.yaml** configuration file. + +Configuration Description +------------------------- + +Users can configure the port, SSL, retry times, and concurrency. + +.. table:: **Table 1** Parameters + + +----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------+ + | Parameter | Description | Default Value | Mandatory | + +========================================+================================================================================================================================================================+================+===========+ + | blob.server.port | Blob server port | 32456 to 32520 | No | + +----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------+ + | blob.service.ssl.enabled | Indicates whether to enable the encryption for the blob transmission channel. This parameter is valid only when the global switch **security.ssl** is enabled. | true | Yes | + +----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------+ + | blob.fetch.retries | Number of times that TaskManager tries to download blob files from JobManager. | 50 | No | + +----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------+ + | blob.fetch.num-concurrent | Number of concurrent tasks for downloading blob files supported by JobManager. | 50 | No | + +----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------+ + | blob.fetch.backlog | Number of blob files, such as **.jar** files, to be downloaded in the queue supported by JobManager. The unit is count. | 1000 | No | + +----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------+ + | library-cache-manager.cleanup.interval | Interval at which JobManager deletes the JAR files stored on the HDFS when the user cancels the Flink job. The unit is second. | 3600 | No | + +----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+-----------+ + +.. note:: + + For versions earlier than MRS 3.x, **library-cache-manager.cleanup.interval** cannot be configured. diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/configuring_parameter_paths.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/configuring_parameter_paths.rst new file mode 100644 index 0000000..df3dc8e --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/configuring_parameter_paths.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_1565.html + +.. _mrs_01_1565: + +Configuring Parameter Paths +=========================== + +All parameters of Flink must be set on a client. The path of a configuration file is as follows: *Client installation path*\ **/Flink/flink/conf/flink-conf.yaml**. + +.. note:: + + - You are advised to set the parameters in the format of *Key*\ **:** *Value* in the **flink-conf.yaml** configuration file on the client. + + Example: **taskmanager.heap.size: 1024mb** + + A space is required between *Key*\ **:** and *Value*. + + - If parameters are modified in the Flink service configuration, you need to download and install the client again after the configuration is complete. diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/distributed_coordination_via_akka.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/distributed_coordination_via_akka.rst new file mode 100644 index 0000000..8ab3382 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/distributed_coordination_via_akka.rst @@ -0,0 +1,94 @@ +:original_name: mrs_01_1568.html + +.. _mrs_01_1568: + +Distributed Coordination (via Akka) +=================================== + +Scenarios +--------- + +The Akka actor model is the basis of communications between the Flink client and JobManager, JobManager and TaskManager, as well as TaskManager and TaskManager. Flink enables you to configure the Akka connection parameters in the **flink-conf.yaml** file based on the network environment or optimization policy. + +Configuration Description +------------------------- + +You can configure timeout settings of message sending and waiting, and the Akka listening mechanism Deathwatch. + +For versions earlier than MRS 3.x, see :ref:`Table 1 `. + +.. _mrs_01_1568__table8198157172919: + +.. table:: **Table 1** Parameters + + +-------------------------------+-----------+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Mandatory | Default Value | Description | + +===============================+===========+=====================================================================+============================================================================================================================================================================================================================================================================================+ + | akka.ask.timeout | No | 10 s | Timeout duration of Akka asynchronous and block requests. If a Flink timeout failure occurs, this value can be increased. Timeout occurs when the machine processing speed is slow or the network is blocked. The unit is ms/s/m/h/d. | + +-------------------------------+-----------+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | akka.lookup.timeout | No | 10 s | Timeout duration for JobManager actor object searching. The unit is ms/s/m/h/d. | + +-------------------------------+-----------+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | akka.framesize | No | 10485760b | Maximum size of the message transmitted between JobManager and TaskManager. If a Flink error occurs because the message exceeds this limit, the value can be increased. The unit is b/B/KB/MB. | + +-------------------------------+-----------+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | akka.watch.heartbeat.interval | No | 10 s | Heartbeat interval at which the Akka DeathWatch mechanism detects disconnected TaskManager. If TaskManager is frequently and incorrectly marked as disconnected due to heartbeat loss or delay, the value can be increased. The unit is ms/s/m/h/d. | + +-------------------------------+-----------+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | akka.watch.heartbeat.pause | No | 60 s | Acceptable heartbeat pause for Akka DeathWatch mechanism. A small value indicates that irregular heartbeat is not accepted. The unit is ms/s/m/h/d. | + +-------------------------------+-----------+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | akka.watch.threshold | No | 12 | DeathWatch failure detection threshold. A small value is prone to mark normal TaskManager as failed and a large value increases failure detection time. | + +-------------------------------+-----------+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | akka.tcp.timeout | No | 20 s | Timeout duration of Transmission Control Protocol (TCP) connection request. If TaskManager connection timeout occurs frequently due to the network congestion, the value can be increased. The unit is ms/s/m/h/d. | + +-------------------------------+-----------+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | akka.throughput | No | 15 | Number of messages processed by Akka in batches. After an operation, the processing thread is returned to the thread pool. A small value indicates the fair scheduling for actor message processing. A large value indicates improved overall performance but lowered scheduling fairness. | + +-------------------------------+-----------+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | akka.log.lifecycle.events | No | false | Switch of Akka remote time logging, which can be enabled for debugging. | + +-------------------------------+-----------+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | akka.startup-timeout | No | The default value is the same as the value of **akka.ask.timeout**. | Timeout duration of remote component started by Akka. The unit is ms/s/m/h/d. | + +-------------------------------+-----------+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | akka.ssl.enabled | Yes | true | Switch of Akka communication SSL. This parameter is valid only when the global switch **security.ssl** is enabled. | + +-------------------------------+-----------+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +For configuration items for MRS 3.x or later, see :ref:`Table 2 `. + +.. _mrs_01_1568__tf4c61b7f27244ebcaf53768dc54266f0: + +.. table:: **Table 2** Parameters + + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | Parameter | Description | Default Value | Mandatory | + +=================================================+============================================================================================================================================================================================================================================================================================+=====================================================================+===========+ + | akka.ask.timeout | Timeout duration of Akka asynchronous and block requests. If a Flink timeout failure occurs, this value can be increased. Timeout occurs when the machine processing speed is slow or the network is blocked. The unit is ms/s/m/h/d. | 10s | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.lookup.timeout | Timeout duration for JobManager actor object searching. The unit is ms/s/m/h/d. | 10s | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.framesize | Maximum size of the message transmitted between JobManager and TaskManager. If a Flink error occurs because the message exceeds this limit, the value can be increased. The unit is b/B/KB/MB. | 10485760b | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.watch.heartbeat.interval | Heartbeat interval at which the Akka DeathWatch mechanism detects disconnected TaskManager. If TaskManager is frequently and incorrectly marked as disconnected due to heartbeat loss or delay, the value can be increased. The unit is ms/s/m/h/d. | 10s | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.watch.heartbeat.pause | Acceptable heartbeat pause for Akka DeathWatch mechanism. A small value indicates that irregular heartbeat is not accepted. The unit is ms/s/m/h/d. | 60s | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.watch.threshold | DeathWatch failure detection threshold. A small value may mark normal TaskManager as failed and a large value increases failure detection time. | 12 | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.tcp.timeout | Timeout duration of Transmission Control Protocol (TCP) connection request. If TaskManager connection timeout occurs frequently due to the network congestion, the value can be increased. The unit is ms/s/m/h/d. | 20s | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.throughput | Number of messages processed by Akka in batches. After an operation, the processing thread is returned to the thread pool. A small value indicates the fair scheduling for actor message processing. A large value indicates improved overall performance but lowered scheduling fairness. | 15 | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.log.lifecycle.events | Switch of Akka remote time logging, which can be enabled for debugging. | false | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.startup-timeout | Timeout interval before a remote component fails to be started. The value must contain a time unit (ms/s/min/h/d). | The default value is the same as the value of **akka.ask.timeout**. | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.ssl.enabled | Switch of Akka communication SSL. This parameter is valid only when the global switch **security.ssl** is enabled. | true | Yes | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.client-socket-worker-pool.pool-size-factor | Factor that is used to determine the thread pool size. The pool size is calculated based on the following formula: ceil (available processors \* factor). The size is bounded by the **pool-size-min** and **pool-size-max** values. | 1.0 | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.client-socket-worker-pool.pool-size-max | Maximum number of threads calculated based on the factor. | 2 | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.client-socket-worker-pool.pool-size-min | Minimum number of threads calculated based on the factor. | 1 | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.client.timeout | Timeout duration of the client. The value must contain a time unit (ms/s/min/h/d). | 60s | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.server-socket-worker-pool.pool-size-factor | Factor that is used to determine the thread pool size. The pool size is calculated based on the following formula: ceil (available processors \* factor). The size is bounded by the **pool-size-min** and **pool-size-max** values. | 1.0 | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.server-socket-worker-pool.pool-size-max | Maximum number of threads calculated based on the factor. | 2 | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ + | akka.server-socket-worker-pool.pool-size-min | Minimum number of threads calculated based on the factor. | 1 | No | + +-------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+-----------+ diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/environment.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/environment.rst new file mode 100644 index 0000000..6588f89 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/environment.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_1576.html + +.. _mrs_01_1576: + +Environment +=========== + +Scenario +-------- + +In scenarios raising special requirements on JVM configuration, users can use configuration items to transfer JVM parameters to the client, JobManager, and TaskManager. + +Configuration +------------- + +Configuration items include JVM parameters. + +.. table:: **Table 1** Parameter description + + +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+ + | Parameter | Description | Default Value | Mandatory | + +===============+=========================================================================================================================================================+=================================================================================================================================================================================================================================================================================================================================================================================================================================================+===========+ + | env.java.opts | JVM parameter, which is transferred to the startup script, JobManager, TaskManager, and Yarn client. For example, transfer remote debugging parameters. | -Xloggc:/gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=20M -Djdk.tls.ephemeralDHKeySize=2048 -Djava.library.path=${HADOOP_COMMON_HOME}/lib/native -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv6Addresses=false -Dbeetle.application.home.path=\ *$BIGDATA_HOME*/common/runtime/security/config | No | + +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+ diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/file_systems.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/file_systems.rst new file mode 100644 index 0000000..4a77a58 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/file_systems.rst @@ -0,0 +1,29 @@ +:original_name: mrs_01_1572.html + +.. _mrs_01_1572: + +File Systems +============ + +Scenario +-------- + +Result files are created when tasks are running. Flink enables you to configure parameters for file creation. + +Configuration Description +------------------------- + +Configuration items include overwriting policy and directory creation. + +.. table:: **Table 1** Parameter description + + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+-----------------+ + | Parameter | Description | Default Value | Mandatory | + +===================================+====================================================================================================================================================================================================================================+=================+=================+ + | fs.overwrite-files | Whether to overwrite the existing file by default when the file is written. | false | No | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+-----------------+ + | fs.output.always-create-directory | When the degree of parallelism (DOP) of file writing programs is greater than 1, a directory is created under the output file path and different result files (one for each parallel writing program) are stored in the directory. | false | No | + | | | | | + | | - If this parameter is set to **true**, a directory is created for the writing program whose DOP is 1 and a result file is stored in the directory. | | | + | | - If this parameter is set to **false**, the file of the writing program whose DOP is 1 is created directly in the output path and no directory is created. | | | + +-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+-----------------+ diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/ha.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/ha.rst new file mode 100644 index 0000000..0ee8b25 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/ha.rst @@ -0,0 +1,65 @@ +:original_name: mrs_01_1575.html + +.. _mrs_01_1575: + +HA +== + +Scenarios +--------- + +The Flink HA mode depends on ZooKeeper. Therefore, ZooKeeper-related configuration items must be set. + +Configuration Description +------------------------- + +Configuration items include the ZooKeeper address, path, and security certificate. + +.. _mrs_01_1575__ta903d6a9c6d24f72abdf46625096cd8c: + +.. table:: **Table 1** Parameters + + +-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------+ + | Parameter | Description | Default Value | Mandatory | + +=======================================================+==============================================================================================================================================================+========================================================================================+=================+ + | high-availability | Whether HA is enabled. Only the following two modes are supported currently: | zookeeper | No | + | | | | | + | | #. none: Only a single JobManager is running. The checkpoint is disabled for JobManager. | | | + | | #. ZooKeeper: | | | + | | | | | + | | - In non-Yarn mode, multiple JobManagers are supported and the leader JobManager is elected. | | | + | | - In Yarn mode, only one JobManager exists. | | | + +-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------+ + | high-availability.zookeeper.quorum | ZooKeeper quorum address. | Automatic configuration | No | + +-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------+ + | high-availability.zookeeper.path.root | Root directory that Flink creates on ZooKeeper, storing metadata required in HA mode. | /flink | No | + +-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------+ + | high-availability.storageDir | Directory for storing JobManager metadata of state backend. ZooKeeper stores only pointers to actual data. | hdfs:///flink/recovery | No | + +-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------+ + | high-availability.zookeeper.client.session-timeout | Session timeout duration on the ZooKeeper client. The unit is millisecond. | 60000 | No | + +-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------+ + | high-availability.zookeeper.client.connection-timeout | Connection timeout duration on the ZooKeeper client. The unit is millisecond. | 15000 | No | + +-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------+ + | high-availability.zookeeper.client.retry-wait | Retry waiting time on the ZooKeeper client. The unit is millisecond. | 5000 | No | + +-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------+ + | high-availability.zookeeper.client.max-retry-attempts | Maximum retry times on the ZooKeeper client. | 3 | No | + +-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------+ + | high-availability.job.delay | Delay of job restart when JobManager recovers. | The default value is the same as the value of **akka.ask.timeout**. | No | + +-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------+ + | high-availability.zookeeper.client.acl | ACL (open creator) of the ZooKeeper node. For ACL options, see https://zookeeper.apache.org/doc/r3.5.1-alpha/zookeeperProgrammers.html#sc_BuiltinACLSchemes. | This parameter is configured automatically according to the cluster installation mode. | Yes | + | | | | | + | | | - Security mode: The default value is **creator**. | | + | | | - Non-security mode: The default value is **open**. | | + +-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------+ + | zookeeper.sasl.disable | Simple authentication and security layer (SASL)-based certificate enable switch. | This parameter is configured automatically according to the cluster installation mode. | Yes | + | | | | | + | | | - Security mode: The default value is **false**. | | + | | | - Non-security mode: The default value is **true**. | | + +-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------+ + | zookeeper.sasl.service-name | - If the ZooKeeper server configures a service whose name is different from **ZooKeeper**, this configuration item can be set. | zookeeper | Yes | + | | - If service names on the client and server are inconsistent, authentication fails. | | | + +-------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+-----------------+ + +.. note:: + + For versions earlier than MRS 3.x, the **high-availability.job.delay** parameter is not supported. diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/index.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/index.rst new file mode 100644 index 0000000..8fc4b55 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/index.rst @@ -0,0 +1,40 @@ +:original_name: mrs_01_0592.html + +.. _mrs_01_0592: + +Flink Configuration Management +============================== + +- :ref:`Configuring Parameter Paths ` +- :ref:`JobManager & TaskManager ` +- :ref:`Blob ` +- :ref:`Distributed Coordination (via Akka) ` +- :ref:`SSL ` +- :ref:`Network communication (via Netty) ` +- :ref:`JobManager Web Frontend ` +- :ref:`File Systems ` +- :ref:`State Backend ` +- :ref:`Kerberos-based Security ` +- :ref:`HA ` +- :ref:`Environment ` +- :ref:`Yarn ` +- :ref:`Pipeline ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + configuring_parameter_paths + jobmanager_&_taskmanager + blob + distributed_coordination_via_akka + ssl + network_communication_via_netty + jobmanager_web_frontend + file_systems + state_backend + kerberos-based_security + ha + environment + yarn + pipeline diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/jobmanager_&_taskmanager.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/jobmanager_&_taskmanager.rst new file mode 100644 index 0000000..76a9198 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/jobmanager_&_taskmanager.rst @@ -0,0 +1,138 @@ +:original_name: mrs_01_1566.html + +.. _mrs_01_1566: + +JobManager & TaskManager +======================== + +Scenarios +--------- + +JobManager and TaskManager are main components of Flink. You can configure the parameters for different security and performance scenarios on the client. + +Configuration Description +------------------------- + +Main configuration items include communication port, memory management, connection retry, and so on. + +For versions earlier than MRS 3.x, see :ref:`Table 1 `. + +.. _mrs_01_1566__table91792019191713: + +.. table:: **Table 1** Parameters + + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Mandatory | Default Value | Description | + +==========================================+=================+=================+==============================================================================================================================================================================================================================================================================================================+ + | taskmanager.rpc.port | No | 32326-32390 | IPC port range of TaskManager | + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | taskmanager.data.port | No | 32391-32455 | Data exchange port range of TaskManager | + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | taskmanager.data.ssl.enabled | No | false | Whether to enable secure sockets layer (SSL) encryption for data transfer between TaskManagers. This parameter is valid only when the global switch **security.ssl** is enabled. | + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | taskmanager.numberOfTaskSlots | No | 3 | Number of slots occupied by TaskManager. Generally, the value is configured as the number of cores of the physical machine. In **yarn-session** mode, the value can be transmitted by only the **-s** parameter. In **yarn-cluster** mode, the value can be transmitted by only the **-ys** parameter. | + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | parallelism.default | No | 1 | Number of concurrent job operators. | + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | taskmanager.memory.size | No | 0 | Amount of heap memory of the Java virtual machine (JVM) that TaskManager reserves for sorting, hash tables, and caching of intermediate results. If unspecified, the memory manager will take a fixed ratio with respect to the size of JVM as specified by **taskmanager.memory.fraction**. The unit is MB. | + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | taskmanager.memory.fraction | No | 0.7 | Ratio of JVM heap memory that TaskManager reserves for sorting, hash tables, and caching of intermediate results. | + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | taskmanager.memory.off-heap | Yes | false | Whether TaskManager uses off-heap memory for sorting, hash tables and intermediate status. You are advised to enable this item for large memory needs to improve memory operation efficiency. | + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | taskmanager.memory.segment-size | No | 32768 | Size of memory segment on TaskManager. Memory segment is the basic unit of the reserved memory space and is used to configure network buffer stacks. The unit is bytes. | + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | taskmanager.memory.preallocate | No | false | Whether TaskManager allocates reserved memory space upon startup. You are advised to enable this item when off-heap memory is used. | + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | taskmanager.registration.initial-backoff | No | 500 ms | Initial interval between two consecutive registration attempts. The unit is ms/s/m/h/d. | + | | | | | + | | | | .. note:: | + | | | | | + | | | | The time value and unit are separated by half-width spaces. ms/s/m/h/d indicates millisecond, second, minute, hour, and day, respectively. | + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | taskmanager.registration.refused-backoff | No | 5 min | Retry interval when a registration connection is rejected by JobManager. | + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | task.cancellation.interval | No | 30000 | Interval between two successive task cancellation attempts. | + +------------------------------------------+-----------------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +For configuration items for MRS 3.x or later, see :ref:`Table 2 `. + +.. _mrs_01_1566__te81a0f353e104e35876ef72d81a3d44e: + +.. table:: **Table 2** Parameters + + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | Parameter | Description | Default Value | Mandatory | + +======================================================+=============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+===============================================================================================================================================+=================+ + | taskmanager.rpc.port | IPC port range of TaskManager | 32326-32390 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | client.rpc.port | Akka system listening port on the Flink client. | 32651-32720 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.data.port | Data exchange port range of TaskManager | 32391-32455 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.data.ssl.enabled | Whether to enable secure sockets layer (SSL) encryption for data transfer between TaskManagers. This parameter is valid only when the global switch **security.ssl** is enabled. | false | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | jobmanager.heap.size | Size of the heap memory of JobManager. In **yarn-session** mode, the value can be transmitted by only the **-jm** parameter. In **yarn-cluster** mode, the value can be transmitted by only the **-yjm** parameter. If the value is smaller than **yarn.scheduler.minimum-allocation-mb** in the Yarn configuration file, the Yarn configuration value is used. Unit: B/KB/MB/GB/TB. | 1024mb | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.heap.size | Size of the heap memory of TaskManager. In **yarn-session** mode, the value can be transmitted by only the **-tm** parameter. In **yarn-cluster** mode, the value can be transmitted by only the **-ytm** parameter. If the value is smaller than **yarn.scheduler.minimum-allocation-mb** in the Yarn configuration file, the Yarn configuration value is used. The unit is B/KB/MB/GB/TB. | 1024mb | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.numberOfTaskSlots | Number of slots occupied by TaskManager. Generally, the value is configured as the number of cores of the physical machine. In **yarn-session** mode, the value can be transmitted by only the **-s** parameter. In **yarn-cluster** mode, the value can be transmitted by only the **-ys** parameter. | 1 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | parallelism.default | Default degree of parallelism, which is used for jobs for which the degree of parallelism is not specified | 1 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.network.numberOfBuffers | Number of TaskManager network transmission buffer stacks. If an error indicates insufficient system buffer, increase the parameter value. | 2048 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.memory.fraction | Ratio of JVM heap memory that TaskManager reserves for sorting, hash tables, and caching of intermediate results. | 0.7 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.memory.off-heap | Whether TaskManager uses off-heap memory for sorting, hash tables and intermediate status. You are advised to enable this item for large memory needs to improve memory operation efficiency. | false | Yes | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.memory.segment-size | Size of the memory buffer used by the memory manager and network stack The unit is bytes. | 32768 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.memory.preallocate | Whether TaskManager allocates reserved memory space upon startup. You are advised to enable this item when off-heap memory is used. | false | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.debug.memory.startLogThread | Enable this item for debugging Flink memory and garbage collection (GC)-related problems. TaskManager periodically collects memory and GC statistics, including the current utilization of heap and off-heap memory pools and GC time. | false | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.debug.memory.logIntervalMs | Interval at which TaskManager periodically collects memory and GC statistics. | 0 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.maxRegistrationDuration | Maximum duration of TaskManager registration on JobManager. If the actual duration exceeds the value, TaskManager is disabled. | 5 min | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.initial-registration-pause | Initial interval between two consecutive registration attempts. The value must contain a time unit (ms/s/min/h/d), for example, 5 seconds. | 500ms | No | + | | | | | + | | | .. note:: | | + | | | | | + | | | The time value and unit are separated by half-width spaces. ms/s/m/h/d indicates millisecond, second, minute, hour, and day, respectively. | | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.max-registration-pause | Maximum registration retry interval in case of TaskManager registration failures. The unit is ms/s/m/h/d. | 30s | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.refused-registration-pause | Retry interval when a TaskManager registration connection is rejected by JobManager. The unit is ms/s/m/h/d. | 10s | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | task.cancellation.interval | Interval between two successive task cancellation attempts. The unit is millisecond. | 30000 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | classloader.resolve-order | Class resolution policies defined when classes are loaded from user codes, which means whether to first check the user code JAR file (**child-first**) or the application class path (**parent-first**). The default setting indicates that the class is first loaded from the user code JAR file, which means that the user code JAR file can contain and load dependencies that are different from those used by Flink. | child-first | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | slot.idle.timeout | Timeout for an idle slot in Slot Pool, in milliseconds. | 50000 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | slot.request.timeout | Timeout for requesting a slot from Slot Pool, in milliseconds. | 300000 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | task.cancellation.timeout | Timeout of task cancellation, in milliseconds. If a task cancellation times out, a fatal TaskManager error may occur. If this parameter is set to **0**, no error is reported when a task cancellation times out. | 180000 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.network.detailed-metrics | Indicates whether to enable the detailed metrics monitoring of network queue lengths. | false | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.network.memory.buffers-per-channel | Maximum number of network buffers used by each output/input channel (sub-partition/incoming channel). In credit-based flow control mode, this indicates how much credit is in each input channel. It should be configured with at least 2 buffers to deliver good performance. One buffer is used to receive in-flight data in the sub-partition, and the other for parallel serialization. | 2 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.network.memory.floating-buffers-per-gate | Number of extra network buffers used by each output gate (result partition) or input gate, indicating the amount of floating credit shared among all input channels in credit-based flow control mode. Floating buffers are distributed based on the backlog feedback (real-time output buffers in sub-partitions) and can help mitigate back pressure caused by unbalanced data distribution among sub-partitions. Increase this value if the round-trip time between nodes is long and/or the number of machines in the cluster is large. | 8 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.network.memory.fraction | Ratio of JVM memory used for network buffers, which determines how many streaming data exchange channels a TaskManager can have at the same time and the extent of channel buffering. Increase this value or the values of **taskmanager.network.memory.min** and **taskmanager.network.memory.max** if the job is rejected or a warning indicating that the system does not have enough buffers is received. Note that the values of **taskmanager.network.memory.min** and **taskmanager.network.memory.max** may overwrite this value. | 0.1 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.network.memory.max | Maximum memory size of the network buffer. The value must contain a unit (B/KB/MB/GB/TB). | 1 GB | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.network.memory.min | Minimum memory size of the network buffer. The value must contain a unit (B/KB/MB/GB/TB). | 64 MB | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.network.request-backoff.initial | Minimum backoff for partition requests of input channels. | 100 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.network.request-backoff.max | Maximum backoff for partition requests of input channels. | 10000 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | taskmanager.registration.timeout | Timeout for TaskManager registration. TaskManager will be terminated if it is not successfully registered within the specified time. The value must contain a time unit (ms/s/min/h/d). | 5 min | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | resourcemanager.taskmanager-timeout | Timeout interval for releasing an idle TaskManager, in milliseconds. | 30000 | No | + +------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/jobmanager_web_frontend.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/jobmanager_web_frontend.rst new file mode 100644 index 0000000..9ddf302 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/jobmanager_web_frontend.rst @@ -0,0 +1,110 @@ +:original_name: mrs_01_1571.html + +.. _mrs_01_1571: + +JobManager Web Frontend +======================= + +Scenarios +--------- + +When JobManager is started, the web server in the same process is also started. + +- You can access the web server to obtain information about the current Flink cluster, including information about JobManager, TaskManager, and running jobs in the cluster. +- You can configure parameters of the web server. + +Configuration Description +------------------------- + +Configuration items include the port, temporary directory, display items, error redirection, and security-related items. + +For versions earlier than MRS 3.x, see :ref:`Table 1 `. + +.. _mrs_01_1571__table812755682411: + +.. table:: **Table 1** Parameters + + +-------------------------------------+-----------+---------------+------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Mandatory | Default Value | Description | + +=====================================+===========+===============+========================================================================================================================+ + | jobmanager.web.port | No | 32261-32325 | Web port. Value range: 32261-32325. | + +-------------------------------------+-----------+---------------+------------------------------------------------------------------------------------------------------------------------+ + | jobmanager.web.allow-access-address | Yes | \* | Web access whitelist. IP addresses are separated by commas (,). Only IP addresses in the whitelist can access the web. | + +-------------------------------------+-----------+---------------+------------------------------------------------------------------------------------------------------------------------+ + +For details about configuration items of MRS 3.\ *x* or later, see :ref:`Table 2 `. + +.. _mrs_01_1571__ta1b7b15d215b4d71973bd1d1fe8816f3: + +.. table:: **Table 2** Parameters + + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | Parameter | Description | Default Value | Mandatory | + +===================================================+==========================================================================================================================================+===============================================================================+=================+ + | flink.security.enable | When installing a Flink cluster, you are required to select **security mode** or **normal mode**. | The value is automatically configured based on the cluster installation mode. | No | + | | | | | + | | - If **security mode** is selected, the value of **flink.security.enable** is automatically set to **true**. | | | + | | - If **normal mode** is selected, the value of **flink.security.enable** is automatically set to **false**. | | | + | | | | | + | | If you want to checker whether Flink cluster is in security mode or normal mode, view the value of **flink.security.enable**. | | | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | rest.bind-port | Web port. Value range: 32261-32325. | 32261-32325 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.history | Number of recent jobs to be displayed. | 5 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.checkpoints.disable | Indicates whether to disable checkpoint statistics. | false | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.checkpoints.history | Number of checkpoint statistical records. | 10 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.backpressure.cleanup-interval | Interval for clearing unaccessed backpressure records. The unit is millisecond. | 600000 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.backpressure.refresh-interval | Interval for updating backpressure records. The unit is millisecond. | 60000 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.backpressure.num-samples | Number of stack tracing records for reverse pressure calculation. | 100 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.backpressure.delay-between-samples | Sampling interval for reverse pressure calculation. The unit is millisecond. | 50 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.ssl.enabled | Whether SSL encryption is enabled for web transmission. This parameter is valid only when the global switch **security.ssl** is enabled. | false | Yes | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.accesslog.enable | Switch to enable or disable web operation logs. The log is stored in **webaccess.log**. | true | Yes | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.x-frame-options | Value of the HTTP security header **X-Frame-Options**. The value can be **SAMEORIGIN**, **DENY**, or **ALLOW-FROM uri**. | DENY | Yes | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.cache-directive | Whether the web page can be cached. | no-store | Yes | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.expires-time | Expiration duration of web page cache. The unit is millisecond. | 0 | Yes | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.allow-access-address | Web access whitelist. IP addresses are separated by commas (,). Only IP addresses in the whitelist can access the web. | \* | Yes | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.access-control-allow-origin | Web page same-origin policy that prevents cross-domain attacks. | \* | Yes | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.refresh-interval | Web page refresh interval. The unit is millisecond. | 3000 | Yes | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.logout-timer | Automatic logout interval when no operation is performed. The unit is millisecond. | 600000 | Yes | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.403-redirect-url | Web page access error 403. If 403 error occurs, the page switch to a specified page. | Automatic configuration | Yes | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.404-redirect-url | Web page access error 404. If 404 error occurs, the page switch to a specified page. | Automatic configuration | Yes | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.415-redirect-url | Web page access error 415. If 415 error occurs, the page switch to a specified page. | Automatic configuration | Yes | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | jobmanager.web.500-redirect-url | Web page access error 500. If 500 error occurs, the page switch to a specified page. | Automatic configuration | Yes | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | rest.await-leader-timeout | Time of the client waiting for the leader address. The unit is millisecond. | 30000 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | rest.client.max-content-length | Maximum content length that the client handles (unit: bytes). | 104857600 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | rest.connection-timeout | Maximum time for the client to establish a TCP connection (unit: ms). | 15000 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | rest.idleness-timeout | Maximum time for a connection to stay idle before failing (unit: ms). | 300000 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | rest.retry.delay | The time that the client waits between retries (unit: ms). | 3000 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | rest.retry.max-attempts | The number of retry times if a retrievable operator fails. | 20 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | rest.server.max-content-length | Maximum content length that the server handles (unit: bytes). | 104857600 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | rest.server.numThreads | Maximum number of threads for the asynchronous processing of requests. | 4 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ + | web.timeout | Timeout for web monitor (unit: ms). | 10000 | No | + +---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+-----------------+ diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/kerberos-based_security.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/kerberos-based_security.rst new file mode 100644 index 0000000..7546c7b --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/kerberos-based_security.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_1574.html + +.. _mrs_01_1574: + +Kerberos-based Security +======================= + +Scenarios +--------- + +Flink Kerberos configuration items must be configured in security mode. + +Configuration Description +------------------------- + +The configuration items include **keytab** and **principal** of Kerberos. + +.. table:: **Table 1** Parameters + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------+-----------+ + | Parameter | Description | Default Value | Mandatory | + +===================================+=================================================================================================================================================================+===============================================================+===========+ + | security.kerberos.login.keytab | Keytab file path. This parameter is a client parameter. | Configure the parameter based on actual service requirements. | Yes | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------+-----------+ + | security.kerberos.login.principal | A parameter on the client. If **security.kerberos.login.keytab** and **security.kerberos.login.principal** are both set, keytab certificate is used by default. | Configure the parameter based on actual service requirements. | No | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------+-----------+ + | security.kerberos.login.contexts | Contexts of the jass file generated by Flink. This parameter is a server parameter. | Client, KafkaClient | Yes | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------+-----------+ diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/network_communication_via_netty.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/network_communication_via_netty.rst new file mode 100644 index 0000000..d401f10 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/network_communication_via_netty.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_1570.html + +.. _mrs_01_1570: + +Network communication (via Netty) +================================= + +Scenario +-------- + +When Flink runs a job, data transmission and reverse pressure detection between tasks depend on Netty. In certain environments, **Netty** parameters should be configured. + +Configuration Description +------------------------- + +For advanced optimization, you can modify the following Netty configuration items. The default configuration can meet the requirements of tasks of large-scale clusters with high concurrent throughput. + +.. table:: **Table 1** Parameter description + + +----------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+-----------+ + | Parameter | Description | Default Value | Mandatory | + +====================================================+=======================================================================================================================================================================+===============+===========+ + | taskmanager.network.netty.num-arenas | Number of Netty memory blocks. | 1 | No | + +----------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+-----------+ + | taskmanager.network.netty.server.numThreads | Number of Netty server threads | 1 | No | + +----------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+-----------+ + | taskmanager.network.netty.client.numThreads | Number of Netty client threads | 1 | No | + +----------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+-----------+ + | taskmanager.network.netty.client.connectTimeoutSec | Netty client connection timeout duration. Unit: second | 120 | No | + +----------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+-----------+ + | taskmanager.network.netty.sendReceiveBufferSize | Size of Netty sending and receiving buffers. This defaults to the system buffer size (**cat /proc/sys/net/ipv4/tcp_[rw]mem**) and is 4 MB in modern Linux. Unit: byte | 4096 | No | + +----------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+-----------+ + | taskmanager.network.netty.transport | Netty transport type, either **nio** or **epoll** | nio | No | + +----------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+-----------+ diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/pipeline.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/pipeline.rst new file mode 100644 index 0000000..69cb9ca --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/pipeline.rst @@ -0,0 +1,32 @@ +:original_name: mrs_01_1578.html + +.. _mrs_01_1578: + +Pipeline +======== + +Scenarios +--------- + +The Netty connection is used among multiple jobs to reduce latency. In this case, NettySink is used on the server and NettySource is used on the client for data transmission. + +This section applies to MRS 3.\ *x* or later clusters. + +Configuration Description +------------------------- + +Configuration items include NettySink information storing path, range of NettySink listening port, whether to enable SSL encryption, domain of the network used for NettySink monitoring. + +.. table:: **Table 1** Parameters + + +---------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------+----------------------------------------------------------------+ + | Parameter | Description | Default Value | Mandatory | + +=============================================+=============================================================================================================================================================================+===========================================================+================================================================+ + | nettyconnector.registerserver.topic.storage | Path (on a third-party server) to information about IP address, port numbers, and concurrency of NettySink. ZooKeeper is recommended for storage. | /flink/nettyconnector | No. However, if pipeline is enabled, the feature is mandatory. | + +---------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------+----------------------------------------------------------------+ + | nettyconnector.sinkserver.port.range | Port range of NettySink. | If MRS cluster is used, the default value is 28444-28843. | No. However, if pipeline is enabled, the feature is mandatory. | + +---------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------+----------------------------------------------------------------+ + | nettyconnector.ssl.enabled | Whether SSL encryption for the communication between NettySink and NettySource is enabled. For details about the encryption key and protocol, see :ref:`SSL `. | false | No. However, if pipeline is enabled, the feature is mandatory. | + +---------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------+----------------------------------------------------------------+ + | nettyconnector.message.delimiter | Delimiter used to configure the message sent by NettySink to the NettySource, which is 2-4 bytes long, and cannot contain **\\n**, **#**, or space. | The default value is **$\_**. | No. However, if pipeline is enabled, the feature is mandatory. | + +---------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------+----------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/ssl.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/ssl.rst new file mode 100644 index 0000000..0230ed1 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/ssl.rst @@ -0,0 +1,91 @@ +:original_name: mrs_01_1569.html + +.. _mrs_01_1569: + +SSL +=== + +Scenarios +--------- + +When the secure Flink cluster is required, SSL-related configuration items must be set. + +Configuration Description +------------------------- + +Configuration items include the SSL switch, certificate, password, and encryption algorithm. + +For versions earlier than MRS 3.x, see :ref:`Table 1 `. + +.. _mrs_01_1569__table956544414184: + +.. table:: **Table 1** Parameters + + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | Parameter | Mandatory | Default Value | Description | + +===========================================+=================+===================================================================================================================================+===============================================================================+ + | security.ssl.internal.enabled | Yes | The value is automatically configured according to the cluster installation mode. | Main switch of internal communication SSL. | + | | | | | + | | | - Security mode: The default value is **true**. | | + | | | - Normal mode: The default value is **false**. | | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | security.ssl.internal.keystore | Yes | ``-`` | Java keystore file. | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | security.ssl.internal.keystore-password | Yes | ``-`` | Password used to decrypt the keystore file. | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | security.ssl.internal.key-password | Yes | ``-`` | Password used to decrypt the server key in the keystore file. | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | security.ssl.internal.truststore | Yes | ``-`` | **truststore** file containing the public CA certificates. | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | security.ssl.internal.truststore-password | Yes | ``-`` | Password used to decrypt the truststore file. | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | security.ssl.protocol | Yes | TLSv1.2 | SSL transmission protocol version | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | security.ssl.algorithms | Yes | The default value is **TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_DHE_RSA_WITH_AES_128_CBC_SHA256,TLS_DHE_DSS_WITH_AES_128_CBC_SHA256**. | Supported SSL standard algorithm. For details, see the Java official website. | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | security.ssl.rest.enabled | Yes | The value is automatically configured according to the cluster installation mode. | Main switch of external communication SSL. | + | | | | | + | | | - Security mode: The default value is **true**. | | + | | | - Normal mode: The default value is **false**. | | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | security.ssl.rest.keystore | Yes | ``-`` | Java keystore file. | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | security.ssl.rest.keystore-password | Yes | ``-`` | Password used to decrypt the keystore file. | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | security.ssl.rest.key-password | Yes | ``-`` | Password used to decrypt the server key in the keystore file. | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | security.ssl.rest.truststore | Yes | ``-`` | **truststore** file containing the public CA certificates. | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + | security.ssl.rest.truststore-password | Yes | ``-`` | Password used to decrypt the truststore file. | + +-------------------------------------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ + +For configuration items for MRS 3.x or later, see :ref:`Table 2 `. + +.. _mrs_01_1569__t0257778dfe3544959abfc85715cc5672: + +.. table:: **Table 2** Parameters + + +----------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | Parameter | Description | Default Value | Mandatory | + +==================================+===============================================================================+=======================================================================================================================================================+=================+ + | security.ssl.enabled | Main switch of internal communication SSL. | The value is automatically configured according to the cluster installation mode. | Yes | + | | | | | + | | | - Security mode: The default value is **true**. | | + | | | - Non-security mode: The default value is **false**. | | + +----------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | security.ssl.keystore | Java keystore file. | ``-`` | Yes | + +----------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | security.ssl.keystore-password | Password used to decrypt the keystore file. | ``-`` | Yes | + +----------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | security.ssl.key-password | Password used to decrypt the server key in the keystore file. | ``-`` | Yes | + +----------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | security.ssl.truststore | **truststore** file containing the public CA certificates. | ``-`` | Yes | + +----------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | security.ssl.truststore-password | Password used to decrypt the truststore file. | ``-`` | Yes | + +----------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | security.ssl.protocol | SSL transmission protocol version. | TLSv1.2 | Yes | + +----------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ + | security.ssl.algorithms | Supported SSL standard algorithm. For details, see the Java official website. | The default value: | Yes | + | | | | | + | | | "TLS_DHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_DHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384" | | + +----------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/state_backend.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/state_backend.rst new file mode 100644 index 0000000..d2c1893 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/state_backend.rst @@ -0,0 +1,44 @@ +:original_name: mrs_01_1573.html + +.. _mrs_01_1573: + +State Backend +============= + +Scenarios +--------- + +Flink enables HA and job exception, as well as job pause and recovery during version upgrade. Flink depends on state backend to store job states and on the restart strategy to restart a job. You can configure state backend and the restart strategy. + +Configuration Description +------------------------- + +Configuration items include the state backend type, storage path, and restart strategy. + +.. table:: **Table 1** Parameters + + +---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+ + | Parameter | Description | Default Value | Mandatory | + +=========================================================+=====================================================================================================================================================================+================================================================================================================================================+============================+ + | state.backend.fs.checkpointdir | Path when the backend is set to **filesystem**. The path must be accessible by JobManager. Only the local mode is supported. In the cluster mode, use an HDFS path. | hdfs:///flink/checkpoints | No | + +---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+ + | state.savepoints.dir | Savepoint storage directory used by Flink to restore and update jobs. When a savepoint is triggered, the metadata of the savepoint is saved to this directory. | hdfs:///flink/savepoint | Mandatory in security mode | + +---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+ + | restart-strategy | Default restart policy, which is used for jobs for which no restart policy is specified. The options are as follows: | none | No | + | | | | | + | | - fixed-delay | | | + | | - failure-rate | | | + | | - none | | | + +---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+ + | restart-strategy.fixed-delay.attempts | Number of retry times when the fixed-delay restart strategy is used. | - If the checkpoint is enabled, the default value is the value of **Integer.MAX_VALUE**. | No | + | | | - If the checkpoint is disabled, the default value is 3. | | + +---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+ + | restart-strategy.fixed-delay.delay | Retry interval when the fixed-delay strategy is used. The unit is ms/s/m/h/d. | - If the checkpoint is enabled, the default value is 10s. | No | + | | | - If the checkpoint is disabled, the default value is the value of **akka.ask.timeout**. | | + +---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+ + | restart-strategy.failure-rate.max-failures-per-interval | Maximum number of restart times in a specified period before a job fails when the fault rate policy is used. | 1 | No | + +---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+ + | restart-strategy.failure-rate.failure-rate-interval | Retry interval when the failure-rate strategy is used. The unit is ms/s/m/h/d. | 60 s | No | + +---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+ + | restart-strategy.failure-rate.delay | Retry interval when the failure-rate strategy is used. The unit is ms/s/m/h/d. | The default value is the same as the value of **akka.ask.timeout**. For details, see :ref:`Distributed Coordination (via Akka) `. | No | + +---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+ diff --git a/doc/component-operation-guide/source/using_flink/flink_configuration_management/yarn.rst b/doc/component-operation-guide/source/using_flink/flink_configuration_management/yarn.rst new file mode 100644 index 0000000..aff02f8 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_configuration_management/yarn.rst @@ -0,0 +1,32 @@ +:original_name: mrs_01_1577.html + +.. _mrs_01_1577: + +Yarn +==== + +Scenario +-------- + +Flink runs on a Yarn cluster and JobManager runs on ApplicationMaster. Certain configuration parameters of JobManager depend on Yarn. By setting Yarn-related configuration items, Flink is enabled to run better on Yarn. + +Configuration Description +------------------------- + +The configuration items include the memory, virtual kernel, and port of the Yarn container. + +.. table:: **Table 1** Parameter description + + +--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------+-----------+ + | Parameter | Description | Default Value | Mandatory | + +================================+===============================================================================================================================================================================================================================================================================+=======================================================+===========+ + | yarn.maximum-failed-containers | Maximum number of containers the system is going to reallocate in case of a container failure of TaskManager The default value is the number of TaskManagers when the Flink cluster is started. | 5 | No | + +--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------+-----------+ + | yarn.application-attempts | Number of ApplicationMaster restarts. The value is the maximum value in the validity interval that is set to Akka's timeout in Flink. After the restart, the IP address and port number of ApplicationMaster will change and you will need to connect to the client manually. | 2 | No | + +--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------+-----------+ + | yarn.heartbeat-delay | Time between heartbeats with the ApplicationMaster and Yarn ResourceManager in seconds. Unit: second | 5 | No | + +--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------+-----------+ + | yarn.containers.vcores | Number of virtual cores of each Yarn container | The default value is the number of TaskManager slots. | No | + +--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------+-----------+ + | yarn.application-master.port | ApplicationMaster port number setting. A port number range is supported. | 32586-32650 | No | + +--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------+-----------+ diff --git a/doc/component-operation-guide/source/using_flink/flink_log_overview.rst b/doc/component-operation-guide/source/using_flink/flink_log_overview.rst new file mode 100644 index 0000000..074d74b --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_log_overview.rst @@ -0,0 +1,88 @@ +:original_name: mrs_01_0596.html + +.. _mrs_01_0596: + +Flink Log Overview +================== + +Log Description +--------------- + +**Log path:** + +- Run logs of a Flink job: **${BIGDATA_DATA_HOME}/hadoop/data${i}/nm/containerlogs/application_${appid}/container_{$contid}** + + .. note:: + + The logs of executing tasks are stored in the preceding path. After the execution is complete, the Yarn configuration determines whether these logs are gathered to the HDFS directory. + +- FlinkResource run logs: **/var/log/Bigdata/flink/flinkResource** + +**Log archive rules:** + +#. FlinkResource run logs: + + - By default, service logs are backed up each time when the log size reaches 20 MB. A maximum of 20 logs can be reserved without being compressed. + + .. note:: + + For versions earlier than MRS 3.x, The executor logs are backed up each time when the log size reaches 30 MB. A maximum of 20 logs can be reserved without being compressed. + + - You can set the log size and number of compressed logs on the Manager page or modify the corresponding configuration items in **log4j-cli.properties**, **log4j.properties**, and **log4j-session.properties** in **/opt/client/Flink/flink/conf/** on the client. **/opt/client** is the client installation directory. + + .. table:: **Table 1** FlinkResource log list + + ====================== ================ ======================== + Type Name Description + ====================== ================ ======================== + FlinkResource run logs checkService.log Health check log + \ kinit.log Initialization log + \ postinstall.log Service installation log + \ prestart.log Prestart script log + \ start.log Startup log + ====================== ================ ======================== + +Log Level +--------- + +:ref:`Table 2 ` describes the log levels supported by Flink. The priorities of log levels are ERROR, WARN, INFO, and DEBUG in descending order. Logs whose levels are higher than or equal to the specified level are printed. The number of printed logs decreases as the specified log level increases. + +.. _mrs_01_0596__table63318572917: + +.. table:: **Table 2** Log levels + + ===== ============================================================= + Level Description + ===== ============================================================= + ERROR Error information about the current event processing + WARN Exception information about the current event processing + INFO Normal running status information about the system and events + DEBUG System information and system debugging information + ===== ============================================================= + +To modify log levels, perform the following steps: + +#. Go to the **All Configurations** page of Flink by referring to :ref:`Modifying Cluster Service Configuration Parameters `. +#. On the menu bar on the left, select the log menu of the target role. +#. Select a desired log level. +#. Save the configuration. In the displayed dialog box, click **OK** to make the configurations take effect. + +.. note:: + + - After the configuration is complete, you do not need to restart the service. Download the client again for the configuration to take effect. + - You can also change the configuration items corresponding to the log level in **log4j-cli.properties**, **log4j.properties**, and **log4j-session.properties** in **/opt/client/Flink/flink/conf/** on the client. **/opt/client** is the client installation directory. + - When a job is submitted using a client, a log file is generated in the **log** folder on the client. The default umask value is **0022**. Therefore, the default log permission is **644**. To change the file permission, you need to change the umask value. For example, to change the umask value of user **omm**: + + - Add **umask 0026** to the end of the **/home/omm/.baskrc** file. + - Run the **source /home/omm/.baskrc** command to make the file permission take effect. + +Log Format +---------- + +.. table:: **Table 3** Log formats + + +---------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Type | Format | Example | + +=========+========================================================================================================================================================+=====================================================================================================================================================================================================================================+ + | Run log | <*yyyy-MM-dd HH:mm:ss,SSS*>|<*Log level*>|<*Name of the thread that generates the log*>|<*Message in the log*>|<*Location where the log event occurs*> | 2019-06-27 21:30:31,778 \| INFO \| [flink-akka.actor.default-dispatcher-3] \| TaskManager container_e10_1498290698388_0004_02_000007 has started. \| org.apache.flink.yarn.YarnFlinkResourceManager (FlinkResourceManager.java:368) | + +---------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_flink/flink_performance_tuning/index.rst b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/index.rst new file mode 100644 index 0000000..bba6b0e --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/index.rst @@ -0,0 +1,14 @@ +:original_name: mrs_01_0597.html + +.. _mrs_01_0597: + +Flink Performance Tuning +======================== + +- :ref:`Optimization DataStream ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + optimization_datastream/index diff --git a/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/configuring_dop.rst b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/configuring_dop.rst new file mode 100644 index 0000000..dc3e364 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/configuring_dop.rst @@ -0,0 +1,64 @@ +:original_name: mrs_01_1589.html + +.. _mrs_01_1589: + +Configuring DOP +=============== + +Scenario +-------- + +The degree of parallelism (DOP) indicates the number of tasks to be executed concurrently. It determines the number of data blocks after the operation. Configuring the DOP will optimize the number of tasks, data volume of each task, and the host processing capability. + +Query the CPU and memory usage. If data and tasks are not evenly distributed among nodes, increase the DOP for even distribution. + +Procedure +--------- + +Configure the DOP at one of the following layers (the priorities of which are in the descending order) based on the actual memory, CPU, data, and application logic conditions: + +- Operator + + Call the **setParallelism()** method to specify the DOP of an operator, data source, and sink. For example: + + .. code-block:: + + final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); + + DataStream text = [...] + DataStream> wordCounts = text + .flatMap(new LineSplitter()) + .keyBy(0) + .timeWindow(Time.seconds(5)) + .sum(1).setParallelism(5); + + wordCounts.print(); + + env.execute("Word Count Example"); + +- Execution environment + + Flink runs in the execution environment which defines a default DOP for operators, data source and data sink. + + Call the **setParallelism()** method to specify the default DOP of the execution environment. Example: + + .. code-block:: + + final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); + env.setParallelism(3); + DataStream text = [...] + DataStream> wordCounts = [...] + wordCounts.print(); + env.execute("Word Count Example"); + +- Client + + Specify the DOP when submitting jobs to Flink on the client. If you use the CLI client, specify the DOP using the **-p** parameter. Example: + + .. code-block:: + + ./bin/flink run -p 10 ../examples/*WordCount-java*.jar + +- System + + On the Flink client, modify the **parallelism.default** parameter in the **flink-conf.yaml** file under the conf to specify the DOP for all execution environments. diff --git a/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/configuring_process_parameters.rst b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/configuring_process_parameters.rst new file mode 100644 index 0000000..5e37b84 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/configuring_process_parameters.rst @@ -0,0 +1,43 @@ +:original_name: mrs_01_1590.html + +.. _mrs_01_1590: + +Configuring Process Parameters +============================== + +Scenario +-------- + +In Flink on Yarn mode, there are JobManagers and TaskManagers. JobManagers and TaskManagers schedule and run tasks. + +Therefore, configuring parameters of JobManagers and TaskManagers can optimize the execution performance of a Flink application. Perform the following steps to optimize the Flink cluster performance. + +Procedure +--------- + +#. Configure JobManager memory. + + JobManagers are responsible for task scheduling and message communications between TaskManagers and ResourceManagers. JobManager memory needs to be increased as the number of tasks and the DOP increases. + + JobManager memory needs to be configured based on the number of tasks. + + - When running the **yarn-session** command, add the **-jm MEM** parameter to configure the memory. + - When running the **yarn-cluster** command, add the **-yjm MEM** parameter to configure the memory. + +#. Configure the number of TaskManagers. + + Each core of a TaskManager can run a task at the same time. Increasing the number of TaskManagers has the same effect as increasing the DOP. Therefore, you can increase the number of TaskManagers to improve efficiency when there are sufficient resources. + +#. Configure the number of TaskManager slots. + + Multiple cores of a TaskManager can process multiple tasks at the same time. This has the same effect as increasing the DOP. However, the balance between the number of cores and the memory must be maintained, because all cores of a TaskManager share the memory. + + - When running the **yarn-session** command, add the **-s NUM** parameter to configure the number of slots. + - When running the **yarn-cluster** command, add the **-ys NUM** parameter to configure the number of slots. + +#. Configure TaskManager memory. + + TaskManager memory is used for task execution and communication. A large-size task requires more resources. In this case, you can increase the memory. + + - When running the **yarn-session** command, add the **-tm MEM** parameter to configure the memory. + - When running the **yarn-cluster** command, add the **-ytm MEM** parameter to configure the memory. diff --git a/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/configuring_the_netty_network_communication.rst b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/configuring_the_netty_network_communication.rst new file mode 100644 index 0000000..fcc517e --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/configuring_the_netty_network_communication.rst @@ -0,0 +1,22 @@ +:original_name: mrs_01_1592.html + +.. _mrs_01_1592: + +Configuring the Netty Network Communication +=========================================== + +Scenarios +--------- + +The communication of Flink is based on Netty network. The network performance determines the data switching speed and task execution efficiency. Therefore, the performance of Flink can be optimized by optimizing the Netty network. + +Procedure +--------- + +In the **conf/flink-conf.yaml** file on the client, change configurations as required. Exercise caution when changing default values, because default values are optimal. + +- **taskmanager.network.netty.num-arenas**: Specifies the number of arenas of Netty. The default value is **taskmanager.numberOfTaskSlots**. +- **taskmanager.network.netty.server.numThreads** and **taskmanager.network.netty.client.numThreads**: specify the number of threads on the client and server. The default value is **taskmanager.numberOfTaskSlots**. +- **taskmanager.network.netty.client.connectTimeoutSec**: specifies the timeout interval for connection of TaskManager client. The default value is **120s**. +- **taskmanager.network.netty.sendReceiveBufferSize**: specifies the buffer size of the Netty network. The default value is the buffer size (cat /proc/sys/net/ipv4/tcp_[rw]mem) of the system and the value is usually 4 MB. +- **taskmanager.network.netty.transport**: specifies the transmission method of the Netty network. The default value is **nio**. The value can only be **nio** and **epoll**. diff --git a/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/experience_summary.rst b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/experience_summary.rst new file mode 100644 index 0000000..81ab679 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/experience_summary.rst @@ -0,0 +1,30 @@ +:original_name: mrs_01_1593.html + +.. _mrs_01_1593: + +Experience Summary +================== + +Avoiding Data Skew +------------------ + +If data skew occurs (certain data volume is extremely large), the execution time of tasks is inconsistent even though no GC is performed. + +- Redefine keys. Use keys of smaller granularity to optimize the task size. +- Modify the DOP. +- Call the rebalance operation to balance data partitions. + +Setting Timeout Interval for the Buffer +--------------------------------------- + +- During the execution of tasks, data is exchanged through network. You can set the **setBufferTimeout** parameter to specify a buffer timeout interval for data exchanging among different servers. + +- If **setBufferTimeout** is set to **-1**, the refreshing operation is performed when the buffer is full to maximize the throughput. If **setBufferTimeout** is set to **0**, the refreshing operation is performed each time data is received to minimize the delay. If **setBufferTimeout** is set to a value greater than **0**, the refreshing operation is performed after the buffer times out. + + The following is an example: + + .. code-block:: + + env.setBufferTimeout(timeoutMillis); + + env.generateSequence(1,10).map(new MyMapper()).setBufferTimeout(timeoutMillis); diff --git a/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/index.rst b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/index.rst new file mode 100644 index 0000000..e2e2110 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/index.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_1587.html + +.. _mrs_01_1587: + +Optimization DataStream +======================= + +- :ref:`Memory Configuration Optimization ` +- :ref:`Configuring DOP ` +- :ref:`Configuring Process Parameters ` +- :ref:`Optimizing the Design of Partitioning Method ` +- :ref:`Configuring the Netty Network Communication ` +- :ref:`Experience Summary ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + memory_configuration_optimization + configuring_dop + configuring_process_parameters + optimizing_the_design_of_partitioning_method + configuring_the_netty_network_communication + experience_summary diff --git a/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/memory_configuration_optimization.rst b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/memory_configuration_optimization.rst new file mode 100644 index 0000000..d396f51 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/memory_configuration_optimization.rst @@ -0,0 +1,30 @@ +:original_name: mrs_01_1588.html + +.. _mrs_01_1588: + +Memory Configuration Optimization +================================= + +Scenarios +--------- + +The computing of Flink depends on memory. If the memory is insufficient, the performance of Flink will be greatly deteriorated. One solution is to monitor garbage collection (GC) to evaluate the memory usage. If the memory becomes the performance bottleneck, optimize the memory usage according to the actual situation. + +If **Full GC** is frequently reported in the Container GC on the Yarn that monitors the node processes, the GC needs to be optimized. + +.. note:: + + In the **env.java.opts** configuration item of the **conf/flink-conf.yaml** file on the client, add the **-Xloggc:/gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=20M** parameter. The GC log is configured by default. + +Procedure +--------- + +- Optimize GC. + + Adjust the ratio of tenured generation memory to young generation memory. In the **conf/flink-conf.yaml** configuration file on the client, add the **-XX:NewRatio** parameter to the **env.java.opts** configuration item. For example, **-XX:NewRatio=2** indicates that ratio of tenured generation memory to young generation memory is 2:1, that is, the young generation memory occupies one third and tenured generation memory occupies two thirds. + +- When developing Flink applications, optimize the partitioning or grouping operation of DataStream. + + - If partitioning causes data skew, partitions need to be optimized. + - Do not perform concurrent operations, because some operations, WindowAll for example, to DataStream do not support parallelism. + - Do not use set keyBy to string type. diff --git a/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/optimizing_the_design_of_partitioning_method.rst b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/optimizing_the_design_of_partitioning_method.rst new file mode 100644 index 0000000..6dda2f6 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/flink_performance_tuning/optimization_datastream/optimizing_the_design_of_partitioning_method.rst @@ -0,0 +1,66 @@ +:original_name: mrs_01_1591.html + +.. _mrs_01_1591: + +Optimizing the Design of Partitioning Method +============================================ + +Scenarios +--------- + +The divide of tasks can be optimized by optimizing the partitioning method. If data skew occurs in a certain task, the whole execution process is delayed. Therefore, when designing the partitioning method, ensure that partitions are evenly assigned. + +Procedure +--------- + +Partitioning methods are as follows: + +- **Random partitioning**: randomly partitions data. + + .. code-block:: + + dataStream.shuffle(); + +- **Rebalancing (round-robin partitioning)**: evenly partitions data based on round-robin. The partitioning method is useful to optimize data with data skew. + + .. code-block:: + + dataStream.rebalance(); + +- **Rescaling**: assign data to downstream subsets in the form of round-robin. The partitioning method is useful if you want to deliver data from each parallel instance of a data source to subsets of some mappers without the using rebalance (), that is, the complete rebalance operation. + + .. code-block:: + + dataStream.rescale(); + +- **Broadcast**: broadcast data to all partitions. + + .. code-block:: + + dataStream.broadcast(); + +- **User-defined partitioning**: use a user-defined partitioner to select a target task for each element. The user-defined partitioning allows user to partition data based on a certain feature to achieve optimized task execution. + + The following is an example: + + .. code-block:: + + // fromElements builds simple Tuple2 stream + DataStream> dataStream = env.fromElements(Tuple2.of("hello",1), Tuple2.of("test",2), Tuple2.of("world",100)); + + // Defines the key value used for partitioning. Adding one to the value equals to the id. + Partitioner> strPartitioner = new Partitioner>() { + @Override + public int partition(Tuple2 key, int numPartitions) { + return (key.f0.length() + key.f1) % numPartitions; + } + }; + + // The Tuple2 data is used as the basis for partitioning. + + dataStream.partitionCustom(strPartitioner, new KeySelector, Tuple2>() { + @Override + public Tuple2 getKey(Tuple2 value) throws Exception { + return value; + } + }).print(); diff --git a/doc/component-operation-guide/source/using_flink/index.rst b/doc/component-operation-guide/source/using_flink/index.rst new file mode 100644 index 0000000..889a0ad --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/index.rst @@ -0,0 +1,32 @@ +:original_name: mrs_01_0591.html + +.. _mrs_01_0591: + +Using Flink +=========== + +- :ref:`Using Flink from Scratch ` +- :ref:`Viewing Flink Job Information ` +- :ref:`Flink Configuration Management ` +- :ref:`Security Configuration ` +- :ref:`Security Hardening ` +- :ref:`Security Statement ` +- :ref:`Using the Flink Web UI ` +- :ref:`Flink Log Overview ` +- :ref:`Flink Performance Tuning ` +- :ref:`Common Flink Shell Commands ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + using_flink_from_scratch + viewing_flink_job_information + flink_configuration_management/index + security_configuration/index + security_hardening/index + security_statement + using_the_flink_web_ui/index + flink_log_overview + flink_performance_tuning/index + common_flink_shell_commands diff --git a/doc/component-operation-guide/source/using_flink/security_configuration/configuring_kafka.rst b/doc/component-operation-guide/source/using_flink/security_configuration/configuring_kafka.rst new file mode 100644 index 0000000..a8d4910 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/security_configuration/configuring_kafka.rst @@ -0,0 +1,105 @@ +:original_name: mrs_01_1580.html + +.. _mrs_01_1580: + +Configuring Kafka +================= + +Sample project data of Flink is stored in Kafka. A user with Kafka permission can send data to Kafka and receive data from it. + +#. Ensure that clusters, including HDFS, Yarn, Flink, and Kafka are installed. + +#. Create a topic. + + - Run Linux command line to create a topic. Before running commands, ensure that the kinit command, for example, **kinit flinkuser**, is run for authentication. + + .. note:: + + To create a Flink user, you need to have the permission to create Kafka topics. + + The format of the command is shown as follows, in which **{zkQuorum}** indicates ZooKeeper cluster information and the format is *IP*:*port*, and **{Topic}** indicates the topic name. + + **bin/kafka-topics.sh --create --zookeeper {zkQuorum}/kafka --replication-factor 1 --partitions 5 --topic {Topic}** + + Assume the topic name is **topic 1**. The command for creating this topic is displayed as follows: + + .. code-block:: + + /opt/client/Kafka/kafka/bin/kafka-topics.sh --create --zookeeper 10.96.101.32:2181,10.96.101.251:2181,10.96.101.177:2181,10.91.8.160:2181/kafka --replication-factor 1 --partitions 5 --topic topic1 + + - Configure the permission of the topic on the server. + + Set the **allow.everyone.if.no.acl.found** parameter of Kafka Broker to **true**. + +#. Perform the security authentication. + + The Kerberos authentication, SSL encryption authentication, or Kerberos + SSL authentication mode can be used. + + .. note:: + + For versions earlier than MRS 3.x, only Kerberos authentication is supported. + + - **Kerberos authentication** + + - Client configuration + + In the Flink configuration file **flink-conf.yaml**, add configurations about Kerberos authentication. For example, add **KafkaClient** in **contexts** as follows: + + .. code-block:: + + security.kerberos.login.keytab: /home/demo/keytab/flinkuser.keytab + security.kerberos.login.principal: flinkuser + security.kerberos.login.contexts: Client,KafkaClient + security.kerberos.login.use-ticket-cache: false + + .. note:: + + For versions earlier than MRS 3.x, set **security.kerberos.login.keytab** to **/home/demo/flink/release/keytab/flinkuser.keytab**. + + - Running parameter + + Running parameters about the **SASL_PLAINTEXT** protocol are as follows: + + .. code-block:: + + --topic topic1 --bootstrap.servers 10.96.101.32:21007 --security.protocol SASL_PLAINTEXT --sasl.kerberos.service.name kafka //10.96.101.32:21007 indicates the IP:port of the Kafka server. + + - **SSL encryption** + + - Configure the server. + + Log in to FusionInsight Manager, choose **Cluster** > **Services** > **Kafka** > **Configurations**, and set **Type** to **All**. Search for **ssl.mode.enable** and set it to **true**. + + - Configure the client. + + a. Log in to FusionInsight Manager, choose **Cluster > Name of the desired cluster > Services > Kafka > More > Download Client** to download Kafka client. + + b. Use the **ca.crt** certificate file in the client root directory to generate the **truststore** file for the client. + + Run the following command: + + .. code-block:: + + keytool -noprompt -import -alias myservercert -file ca.crt -keystore truststore.jks + + The command execution result is similar to the following: + + |image1| + + c. Run parameters. + + The value of **ssl.truststore.password** must be the same as the password you entered when creating **truststore**. Run the following command to run parameters: + + .. code-block:: + + --topic topic1 --bootstrap.servers 10.96.101.32:9093 --security.protocol SSL --ssl.truststore.location /home/zgd/software/FusionInsight_Kafka_ClientConfig/truststore.jks --ssl.truststore.password XXX + + - **Kerberos+SSL** **encryption** + + After completing preceding configurations of the client and server of Kerberos and SSL, modify the port number and protocol type in running parameters to enable the Kerberos+SSL encryption mode. + + .. code-block:: + + --topic topic1 --bootstrap.servers 10.96.101.32:21009 --security.protocol SASL_SSL --sasl.kerberos.service.name kafka --ssl.truststore.location /home/zgd/software/FusionInsight_Kafka_ClientConfig/truststore.jks --ssl.truststore.password XXX + +.. |image1| image:: /_static/images/en-us_image_0000001295930604.png diff --git a/doc/component-operation-guide/source/using_flink/security_configuration/configuring_pipeline.rst b/doc/component-operation-guide/source/using_flink/security_configuration/configuring_pipeline.rst new file mode 100644 index 0000000..9674064 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/security_configuration/configuring_pipeline.rst @@ -0,0 +1,33 @@ +:original_name: mrs_01_1581.html + +.. _mrs_01_1581: + +Configuring Pipeline +==================== + +This section applies to MRS 3.\ *x* or later clusters. + +#. Configure files. + + - **nettyconnector.registerserver.topic.storage**: (Mandatory) Configures the path (on a third-party server) to information about IP address, port numbers, and concurrency of NettySink. For example: + + .. code-block:: + + nettyconnector.registerserver.topic.storage: /flink/nettyconnector + + - **nettyconnector.sinkserver.port.range**: (Mandatory) Configures the range of port numbers of NettySink. For example: + + .. code-block:: + + nettyconnector.sinkserver.port.range: 28444-28843 + + - **nettyconnector.ssl.enabled**: Configures whether to enable SSL encryption between NettySink and NettySource. The default value is **false**. For example: + + .. code-block:: + + nettyconnector.ssl.enabled: true + +#. Configure security authentication. + + - SASL authentication of ZooKeeper depends on the HA configuration in the **flink-conf.yaml** file. + - SSL configurations such as keystore, truststore, keystore password, truststore password, and password inherit from **flink-conf.yaml**. For details, see :ref:`Encrypted Transmission `. diff --git a/doc/component-operation-guide/source/using_flink/security_configuration/index.rst b/doc/component-operation-guide/source/using_flink/security_configuration/index.rst new file mode 100644 index 0000000..b198aae --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/security_configuration/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0593.html + +.. _mrs_01_0593: + +Security Configuration +====================== + +- :ref:`Security Features ` +- :ref:`Configuring Kafka ` +- :ref:`Configuring Pipeline ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + security_features + configuring_kafka + configuring_pipeline diff --git a/doc/component-operation-guide/source/using_flink/security_configuration/security_features.rst b/doc/component-operation-guide/source/using_flink/security_configuration/security_features.rst new file mode 100644 index 0000000..3f5d270 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/security_configuration/security_features.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_1579.html + +.. _mrs_01_1579: + +Security Features +================= + +Security Features of Flink +-------------------------- + +- All Flink cluster components support authentication. + + - The Kerberos authentication is supported between Flink cluster components and external components, such as Yarn, HDFS, and ZooKeeper. + - The security cookie authentication between Flink cluster components, for example, Flink client and JobManager, JobManager and TaskManager, and TaskManager and TaskManager, are supported. + +- SSL encrypted transmission is supported by Flink cluster components. +- SSL encrypted transmission between Flink cluster components, for example, Flink client and JobManager, JobManager and TaskManager, and TaskManager and TaskManager, are supported. +- Following security hardening approaches for Flink web are supported: + + - Whitelist filtering. Flink web can only be accessed through Yarn proxy. + - Security header enhancement. + +- In Flink clusters, ranges of listening ports of components can be configured. +- In HA mode, ACL control is supported. diff --git a/doc/component-operation-guide/source/using_flink/security_hardening/acl_control.rst b/doc/component-operation-guide/source/using_flink/security_hardening/acl_control.rst new file mode 100644 index 0000000..e5ae425 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/security_hardening/acl_control.rst @@ -0,0 +1,15 @@ +:original_name: mrs_01_1584.html + +.. _mrs_01_1584: + +ACL Control +=========== + +In HA mode of Flink, ZooKeeper can be used to manage clusters and discover services. Zookeeper supports SASL ACL control. Only users who have passed the SASL (Kerberos) authentication have the permission to operate files on ZooKeeper. To enable SASL ACL control, perform following configurations in the Flink configuration file. + +.. code-block:: + + high-availability.zookeeper.client.acl: creator + zookeeper.sasl.disable: false + +For details about configuration items, see :ref:`Table 1 `. diff --git a/doc/component-operation-guide/source/using_flink/security_hardening/authentication_and_encryption.rst b/doc/component-operation-guide/source/using_flink/security_hardening/authentication_and_encryption.rst new file mode 100644 index 0000000..ce17ed3 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/security_hardening/authentication_and_encryption.rst @@ -0,0 +1,281 @@ +:original_name: mrs_01_1583.html + +.. _mrs_01_1583: + +Authentication and Encryption +============================= + +Security Authentication +----------------------- + +Flink uses the following three authentication modes: + +- Kerberos authentication: It is used between the Flink Yarn client and Yarn ResourceManager, JobManager and ZooKeeper, JobManager and HDFS, TaskManager and HDFS, Kafka and TaskManager, as well as TaskManager and ZooKeeper. +- Security cookie authentication: Security cookie authentication is used between Flink Yarn client and JobManager, JobManager and TaskManager, as well as TaskManager and TaskManager. +- Internal authentication of Yarn: The Internal authentication mechanism of Yarn is used between Yarn ResourceManager and ApplicationMaster (AM). + + .. note:: + + - Flink JobManager and Yarn ApplicationMaster are in the same process. + - If Kerberos authentication is enabled for the user's cluster, Kerberos authentication is required. + - For versions earlier than MRS 3.\ *x*, Flink does not support security cookie authentication. + + .. table:: **Table 1** Authentication modes + + +---------------------------------+----------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Authentication Mode | Description | Configuration Method | + +=================================+======================================================================+===============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Kerberos authentication | Currently, only keytab authentication mode is supported. | #. Download the user keytab from the KDC server, and place the keytab to a directory on the host of the Flink client. | + | | | #. Configure the following parameters in the **flink-conf.yaml** file: | + | | | | + | | | a. Keytab path | + | | | | + | | | .. code-block:: | + | | | | + | | | security.kerberos.login.keytab: /home/flinkuser/keytab/abc222.keytab | + | | | | + | | | Note: | + | | | | + | | | **/home/flinkuser/keytab/abc222.keytab** indicates the user directory. | + | | | | + | | | b. Principal name | + | | | | + | | | .. code-block:: | + | | | | + | | | security.kerberos.login.principal: abc222 | + | | | | + | | | c. In HA mode, if ZooKeeper is configured, the Kerberos authentication configuration items must be configured as follows: | + | | | | + | | | .. code-block:: | + | | | | + | | | zookeeper.sasl.disable: false | + | | | security.kerberos.login.contexts: Client | + | | | | + | | | d. If you want to perform Kerberos authentication between Kafka client and Kafka broker, set the value as follows: | + | | | | + | | | .. code-block:: | + | | | | + | | | security.kerberos.login.contexts: Client,KafkaClient | + +---------------------------------+----------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Security cookie authentication | ``-`` | #. In the **bin** directory of the Flink client, run the **generate_keystore.sh** script to generate security cookie, **flink.keystore**, and **flink.truststore**. | + | | | | + | | | Run the **sh generate_keystore.sh** command and enter the user-defined password. The password cannot contain **#**. | + | | | | + | | | .. note:: | + | | | | + | | | After the script is executed, the **flink.keystore** and **flink.truststore** files are generated in the **conf** directory on the Flink client. In the **flink-conf.yaml** file, default values are specified for following parameters: | + | | | | + | | | - Set **security.ssl.keystore** to the absolute path of the **flink.keystore** file. | + | | | - Set **security.ssl.truststore** to the absolute path of the **flink.truststore** file. | + | | | | + | | | - Set **security.cookie** to a random password automatically generated by the **generate_keystore.sh** script. | + | | | - By default, **security.ssl.encrypt.enabled: false** is set in the **flink-conf.yaml** file by default. The **generate_keystore.sh** script sets **security.ssl.key-password**, **security.ssl.keystore-password**, and **security.ssl.truststore-password** to the password entered when the **generate_keystore.sh** script is called. | + | | | | + | | | - For MRS 3.\ *x* or later, if ciphertext is required and **security.ssl.encrypt.enabled** is set to **true** in the **flink-conf.yaml** file, the **generate_keystore.sh** script does not set **security.ssl.key-password**, **security.ssl.keystore-password**, and **security.ssl.truststore-password**. To obtain the values, use the Manager plaintext encryption API by running **curl -k -i -u** *Username*\ **:**\ *Password* **-X POST -HContent-type:application/json -d '{"plainText":"**\ *Password*\ **"}' 'https://**\ *x.x.x.x*\ **:28443/web/api/v2/tools/encrypt'**. | + | | | | + | | | In the preceding command, *Username*\ **:**\ *Password* indicates the user name and password for logging in to the system. The password of **"plainText"** indicates the one used to call the **generate_keystore.sh** script. *x.x.x.x* indicates the floating IP address of Manager. | + | | | | + | | | #. Set **security.enable: true** in the **flink-conf.yaml** file and check whether **security cookie** is configured successfully. Example: | + | | | | + | | | .. code-block:: | + | | | | + | | | security.cookie: ae70acc9-9795-4c48-ad35-8b5adc8071744f605d1d-2726-432e-88ae-dd39bfec40a9 | + +---------------------------------+----------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Internal authentication of Yarn | This authentication mode does not need to be configured by the user. | ``-`` | + +---------------------------------+----------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + One Flink cluster supports only one user. One user can create multiple Flink clusters. + +.. _mrs_01_1583__section270112348585: + +Encrypted Transmission +---------------------- + +Flink uses following encrypted transmission modes: + +- Encrypted transmission inside Yarn: It is used between the Flink Yarn client and Yarn ResourceManager, as well as Yarn ResourceManager and JobManager. +- SSL transmission: SSL transmission is used between Flink Yarn client and JobManager, JobManager and TaskManager, as well as TaskManager and TaskManager. +- Encrypted transmission inside Hadoop: The internal encrypted transmission mode of Hadoop used between JobManager and HDFS, TaskManager and HDFS, JobManager and ZooKeeper, as well as TaskManager and ZooKeeper. + +.. note:: + + Configuration about SSL encrypted transmission is mandatory while configuration about encryption of Yarn and Hadoop is not required. + +To configure SSL encrypted transmission, configure the following parameters in the **flink-conf.yaml** file on the client: + +#. Enable SSL and configure the SSL encryption algorithm. For MRS 3.x or later, see :ref:`Table 2 `. Modify the parameters as required. + + .. _mrs_01_1583__table4164102001915: + + .. table:: **Table 2** Parameter description + + +------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------+ + | Parameter | Example Value | Description | + +==============================+=====================================================================================================================================================+================================================+ + | security.ssl.enabled | true | Enable SSL. | + +------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------+ + | akka.ssl.enabled | true | Enable Akka SSL. | + +------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------+ + | blob.service.ssl.enabled | true | Enable SSL for the Blob channel. | + +------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------+ + | taskmanager.data.ssl.enabled | true | Enable SSL transmissions between TaskManagers. | + +------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------+ + | security.ssl.algorithms | TLS_DHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_DHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 | Configure the SSL encryption algorithm. | + +------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------+ + + For versions earlier than MRS 3.x, see :ref:`Table 3 `. + + .. _mrs_01_1583__table483518144219: + + .. table:: **Table 3** Parameter description + + +-------------------------------+-------------------------------+------------------------------------------------+ + | Parameter | Example Value | Description | + +===============================+===============================+================================================+ + | security.ssl.internal.enabled | true | Enable internal SSL. | + +-------------------------------+-------------------------------+------------------------------------------------+ + | akka.ssl.enabled | true | Enable Akka SSL. | + +-------------------------------+-------------------------------+------------------------------------------------+ + | blob.service.ssl.enabled | true | Enable SSL for the Blob channel. | + +-------------------------------+-------------------------------+------------------------------------------------+ + | taskmanager.data.ssl.enabled | true | Enable SSL transmissions between TaskManagers. | + +-------------------------------+-------------------------------+------------------------------------------------+ + | security.ssl.algorithms | TLS_RSA_WITH_AES128CBC_SHA256 | Configure the SSL encryption algorithm. | + +-------------------------------+-------------------------------+------------------------------------------------+ + + For versions earlier than MRS 3.x, the following parameters in :ref:`Table 4 ` do not exist in the default Flink configuration of MRS. If you want to enable SSL for external connections, add the following parameters. After SSL for external connection is enabled, the native Flink page cannot be accessed using a Yarn proxy, because the Yarn open-source version cannot process HTTPS requests using a proxy. However, you can create a Windows VM in the same VPC of the cluster and access the native Flink page from the VM. + + .. _mrs_01_1583__table2016662031916: + + .. table:: **Table 4** Parameter description + + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Example Value | Description | + +=======================================+==========================+=========================================================================================================================================================+ + | security.ssl.rest.enabled | true | Enable external SSL. If this parameter is set to **true**, set the related parameters by referring to :ref:`Table 4 `. | + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.rest.keystore | ${path}/flink.keystore | Path for storing the **keystore**. | + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.rest.keystore-password | ``-`` | A user-defined password of **keystore**. | + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.rest.key-password | ``-`` | A user-defined password of the SSL key. | + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.rest.truststore | ${path}/flink.truststore | Path for storing the **truststore**. | + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.rest.truststore-password | ``-`` | A user-defined password of **truststore**. | + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + Enabling SSL for data transmission between TaskManagers may pose great impact on the system performance. + +#. In the **bin** directory of the Flink client, run the **sh generate_keystore.sh** ** command. For details, see :ref:`Authentication and Encryption `. The configuration items in :ref:`Table 5 ` are set by default for MRS 3.\ *x* or later. You can also configure them manually. + + .. _mrs_01_1583__table5150181111227: + + .. table:: **Table 5** Parameter description + + +----------------------------------+--------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Example Value | Description | + +==================================+==========================+===========================================================================================================================================================+ + | security.ssl.keystore | ${path}/flink.keystore | Path for storing the **keystore**. **flink.keystore** indicates the name of the **keystore** file generated by the **generate_keystore.sh\*** tool. | + +----------------------------------+--------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.keystore-password | ``-`` | A user-defined password of **keystore**. | + +----------------------------------+--------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.key-password | ``-`` | A user-defined password of the SSL key. | + +----------------------------------+--------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.truststore | ${path}/flink.truststore | Path for storing the **truststore**. **flink.truststore** indicates the name of the **truststore** file generated by the **generate_keystore.sh\*** tool. | + +----------------------------------+--------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.truststore-password | ``-`` | A user-defined password of **truststore**. | + +----------------------------------+--------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + + For versions earlier than MRS 3.x, the **generate_keystore.sh** command is generated automatically, and the configuration items in :ref:`Table 6 ` are set by default. You can also configure them manually. + + .. _mrs_01_1583__table93931053183719: + + .. table:: **Table 6** Parameter description + + +-------------------------------------------+--------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Example Value | Description | + +===========================================+==========================+===========================================================================================================================================================+ + | security.ssl.internal.keystore | ${path}/flink.keystore | Path for storing the **keystore**. **flink.keystore** indicates the name of the **keystore** file generated by the **generate_keystore.sh\*** tool. | + +-------------------------------------------+--------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.internal.keystore-password | ``-`` | A user-defined password of **keystore**. | + +-------------------------------------------+--------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.internal.key-password | ``-`` | A user-defined password of the SSL key. | + +-------------------------------------------+--------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.internal.truststore | ${path}/flink.truststore | Path for storing the **truststore**. **flink.truststore** indicates the name of the **truststore** file generated by the **generate_keystore.sh\*** tool. | + +-------------------------------------------+--------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.internal.truststore-password | ``-`` | A user-defined password of **truststore**. | + +-------------------------------------------+--------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + + For versions earlier than MRS 3.x, if SSL for external connections is enabled, that is, **security.ssl.rest.enabled** is set to **true**, you need to configure the parameters listed in :ref:`Table 7 `. + + .. _mrs_01_1583__table1615251112213: + + .. table:: **Table 7** Parameters + + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Example Value | Description | + +=======================================+==========================+=========================================================================================================================================================+ + | security.ssl.rest.enabled | true | Enable external SSL. If this parameter is set to **true**, set the related parameters by referring to :ref:`Table 7 `. | + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.rest.keystore | ${path}/flink.keystore | Path for storing the **keystore**. | + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.rest.keystore-password | ``-`` | A user-defined password of **keystore**. | + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.rest.key-password | ``-`` | A user-defined password of the SSL key. | + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.rest.truststore | ${path}/flink.truststore | Path for storing the **truststore**. | + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | security.ssl.rest.truststore-password | ``-`` | A user-defined password of **truststore**. | + +---------------------------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + The **path** directory is a user-defined directory for storing configuration files of the SSL keystore and truststore. The commands vary according to the relative path and absolute path. For details, see :ref:`3 ` and :ref:`4 `. + +#. .. _mrs_01_1583__li02291947181712: + + If the **keystore** or **truststore** file path is a relative path, the Flink client directory where the command is executed needs to access this relative path directly. Either of the following method can be used to transmit the keystore and truststore file: + + - Add **-t** option to the **CLI yarn-session.sh** command to transfer the **keystore** and **truststore** file to execution nodes. Example: + + .. code-block:: + + ./bin/yarn-session.sh -t ssl/ + + - Add **-yt** option to the **flink run** command to transfer the **keystore** and **truststore** file to execution nodes. Example: + + .. code-block:: + + ./bin/flink run -yt ssl/ -ys 3 -m yarn-cluster -c org.apache.flink.examples.java.wordcount.WordCount /opt/client/Flink/flink/examples/batch/WordCount.jar + + .. note:: + + - In the preceding example, **ssl/** is the sub-directory of the Flink client directory. It is used to store configuration files of the SSL keystore and truststore. + - The relative path of **ssl/** must be accessible from the current path where the Flink client command is run. + +#. .. _mrs_01_1583__li15533111081818: + + If the keystore or truststore file path is an absolute path, the keystore and truststore files must exist in the absolute path on Flink Client and all nodes. + + .. note:: + + For versions earlier than MRS 3.x, the user who submits the job must have the permission to read the keystore and truststore files. + + Either of the following methods can be used to execute applications. The **-t** or **-yt** option does not need to be added to transmit the **keystore** and **truststore** files. + + - Run the **CLI yarn-session.sh** command of Flink to execute applications. Example: + + .. code-block:: + + ./bin/yarn-session.sh + + - Run the **Flink run** command to execute applications. Example: + + .. code-block:: + + ./bin/flink run -ys 3 -m yarn-cluster -c org.apache.flink.examples.java.wordcount.WordCount /opt/client/Flink/flink/examples/batch/WordCount.jar diff --git a/doc/component-operation-guide/source/using_flink/security_hardening/index.rst b/doc/component-operation-guide/source/using_flink/security_hardening/index.rst new file mode 100644 index 0000000..ee3314b --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/security_hardening/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0594.html + +.. _mrs_01_0594: + +Security Hardening +================== + +- :ref:`Authentication and Encryption ` +- :ref:`ACL Control ` +- :ref:`Web Security ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + authentication_and_encryption + acl_control + web_security diff --git a/doc/component-operation-guide/source/using_flink/security_hardening/web_security.rst b/doc/component-operation-guide/source/using_flink/security_hardening/web_security.rst new file mode 100644 index 0000000..a4b4019 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/security_hardening/web_security.rst @@ -0,0 +1,117 @@ +:original_name: mrs_01_1585.html + +.. _mrs_01_1585: + +Web Security +============ + +Coding Specifications +--------------------- + +Note: The same coding mode is used on the web service client and server to prevent garbled characters and to enable input verification. + +Security hardening: apply UTF-8 to response messages of web server. + +Whitelist-based Filter of IP Addresses +-------------------------------------- + +Note: IP filter must be added to the web server to filter unauthorized requests from the source IP address and prevent unauthorized login. + +Security: Add **jobmanager.web.allow-access-address** to enable the IP filter. By default, only Yarn users are supported. + +.. note:: + + After the client is installed, you need to add the IP address of the client node to the **jobmanager.web.allow-access-address** configuration item. + +Preventing Sending the Absolute Paths to the Client +--------------------------------------------------- + +Note: If an absolute path is sent to a client, the directory structure of the server is exposed, increasing the risk that attackers know and attack the system. + +Security hardening: If the Flink configuration file contains a parameter starting with a slash (/), the first-level directory is deleted. + +Same-origin Policy +------------------ + +The same-source policy applies to MRS 3.x or later. + +If two URL protocols have same hosts and ports, they are of the same origin. Protocols of different origins cannot access each other, unless the source of the visitor is specified on the host of the service to be visited. + +Security hardening: The default value of the header of the response header **Access-Control-Allow-Origin** is the IP address of ResourceManager on Yarn clusters. If the IP address is not from Yarn, mutual access is not allowed. + +Preventing Sensitive Information Disclosure +------------------------------------------- + +Sensitive information disclosure prevention is applicable to MRS 3.x or later. + +Web pages containing sensitive data must not be cached, to avoid leakage of sensitive information or data crosstalk among users who visit the internet through the proxy server. + +Security hardening: Add **Cache-control**, **Pragma**, **Expires** security header. The default value is **Cache-Control: no-store**, **Pragma: no-cache**, and **Expires: 0**. + +The security hardening stops contents interacted between Flink and web server from being cached. + +Anti-Hijacking +-------------- + +Anti-hijacking applies to MRS 3.x or later. + +Since hotlinking and clickjacking use framing technologies, security hardening is required to prevent attacks. + +Security hardening: Add **X-Frame-Options** security header to specify whether the browser will load the pages from **iframe**, **frame** or **object**. The default value is **X-Frame-Options: DENY**, indicating that no pages can be nested to **iframe**, **frame** or **object**. + +Logging calls of the Web Service APIs +------------------------------------- + +This function applies to MRS 3.x or later. + +Calls of the **Flink webmonitor restful** APIs are logged. + +The **jobmanager.web.accesslog.enable** can be added in the **access log**. The default value is **true**. Logs are stored in a separate **webaccess.log** file. + +Cross-Site Request Forgery Prevention +------------------------------------- + +Cross-site request forgery (CSRF) prevention applies to MRS 3.x or later. + +In **Browser/Server** applications, CSRF must be prevented for operations involving server data modification, such as adding, modifying, and deleting. The CSRF forces end users to execute non-intended operations on the current web application. + +Security hardening: Only two post APIs, one delete API, and get interfaces are reserve for modification requests. All other APIs are deleted. + +Troubleshooting +--------------- + +This function applies to MRS 3.x or later. + +When the application is abnormal, exception information is filtered, logged, and returned to the client. + +Security hardening + +- A default error message page to filter information and log detailed error information. +- Four configuration parameters are added to ensure that the error page is switched to a specified URL provided by FusionInsight, preventing exposure of unnecessary information. + + .. table:: **Table 1** Parameter description + + +---------------------------------+--------------------------------------------------------------------------------------+---------------+-----------+ + | Parameter | Description | Default Value | Mandatory | + +=================================+======================================================================================+===============+===========+ + | jobmanager.web.403-redirect-url | Web page access error 403. If 403 error occurs, the page switch to a specified page. | ``-`` | Yes | + +---------------------------------+--------------------------------------------------------------------------------------+---------------+-----------+ + | jobmanager.web.404-redirect-url | Web page access error 404. If 404 error occurs, the page switch to a specified page. | ``-`` | Yes | + +---------------------------------+--------------------------------------------------------------------------------------+---------------+-----------+ + | jobmanager.web.415-redirect-url | Web page access error 415. If 415 error occurs, the page switch to a specified page. | ``-`` | Yes | + +---------------------------------+--------------------------------------------------------------------------------------+---------------+-----------+ + | jobmanager.web.500-redirect-url | Web page access error 500. If 500 error occurs, the page switch to a specified page. | ``-`` | Yes | + +---------------------------------+--------------------------------------------------------------------------------------+---------------+-----------+ + +HTML5 Security +-------------- + +HTML5 security applies to MRS 3.x or later. + +HTML5 is a next generation web development specification that provides new functions and extend the labels for developers. These new labels and functions increase the attack surface and pose attack risks (such as cross-domain resource sharing, client storage, WebWorker, WebRTC, and WebSocket). + +Security hardening: Add the **Access-Control-Allow-Origin** parameter. For example, if you want to enable the cross-domain resource sharing, configure the **Access-Control-Allow-Origin** parameter of the HTTP response header. + +.. note:: + + Flink does not involve security risks of functions such as storage on the client, WebWorker, WebRTC, and WebSocket. diff --git a/doc/component-operation-guide/source/using_flink/security_statement.rst b/doc/component-operation-guide/source/using_flink/security_statement.rst new file mode 100644 index 0000000..cbbd394 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/security_statement.rst @@ -0,0 +1,11 @@ +:original_name: mrs_01_1586.html + +.. _mrs_01_1586: + +Security Statement +================== + +- All security functions of Flink are provided by the open source community or self-developed. Security features that need to be configured by users, such as authentication and SSL encrypted transmission, may affect performance. +- As a big data computing and analysis platform, Flink does not detect sensitive information. Therefore, you need to ensure that the input data is not sensitive. +- You can evaluate whether configurations are secure as required. +- For any security-related problems, contact O&M support. diff --git a/doc/component-operation-guide/source/using_flink/using_flink_from_scratch.rst b/doc/component-operation-guide/source/using_flink/using_flink_from_scratch.rst new file mode 100644 index 0000000..1334ad1 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_flink_from_scratch.rst @@ -0,0 +1,350 @@ +:original_name: mrs_01_0473.html + +.. _mrs_01_0473: + +Using Flink from Scratch +======================== + +This section describes how to use Flink to run wordcount jobs. + +Prerequisites +------------- + +- Flink has been installed in an MRS cluster. +- The cluster runs properly and the client has been correctly installed, for example, in the **/opt/hadoopclient** directory. The client directory in the following operations is only an example. Change it to the actual installation directory. + +Using the Flink Client (Versions Earlier Than MRS 3.x) +------------------------------------------------------ + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd /opt/hadoopclient** + +#. Run the following command to initialize environment variables: + + **source /opt/hadoopclient/bigdata_env** + +#. If Kerberos authentication is enabled for the cluster, perform the following steps. If not, skip this whole step. + + a. Prepare a user for submitting Flink jobs.. + + b. Log in to Manager and download the authentication credential. + + Log in to Manager of the cluster. For details, see :ref:`Accessing MRS Manager (Versions Earlier Than MRS 3.x) `. Choose **System Settings** > **User Management**. In the **Operation** column of the row that contains the added user, choose **More** > **Download Authentication Credential**. + + c. Decompress the downloaded authentication credential package and copy the **user.keytab** file to the client node, for example, to the **/opt/hadoopclient/Flink/flink/conf** directory on the client node. If the client is installed on a node outside the cluster, copy the **krb5.conf** file to the **/etc/** directory on this node. + + d. Configure security authentication by adding the **keytab** path and username in the **/opt/hadoopclient/Flink/flink/conf/flink-conf.yaml** configuration file. + + **security.kerberos.login.keytab:** ** + + **security.kerberos.login.principal:** ** + + Example: + + security.kerberos.login.keytab: /opt/hadoopclient/Flink/flink/conf/user.keytab + + security.kerberos.login.principal: test + + e. Generate the **generate_keystore.sh** script and place it in the **bin** directory of the Flink client. In the **bin** directory of the Flink client, run the following command to perform security hardening. For details, see `Authentication and Encryption `__. Set **password** in the following command to a password for submitting jobs: + + **sh generate_keystore.sh ** + + The script automatically replaces the SSL value in the **/opt/hadoopclient/Flink/flink/conf/flink-conf.yaml** file. For an MRS 2.\ *x* or earlier security cluster, external SSL is disabled by default. To enable external SSL, configure the parameter and run the script again. For details, see `Security Hardening `__. + + .. note:: + + - You do not need to manually generate the **generate_keystore.sh** script. + - After authentication and encryption, the generated **flink.keystore**, **flink.truststore**, and **security.cookie** items are automatically filled in the corresponding configuration items in **flink-conf.yaml**. + + f. Configure paths for the client to access the **flink.keystore** and **flink.truststore** files. + + - Absolute path: After the script is executed, the file path of **flink.keystore** and **flink.truststore** is automatically set to the absolute path **/opt/hadoopclient/Flink/flink/conf/** in the **flink-conf.yaml** file. In this case, you need to move the **flink.keystore** and **flink.truststore** files from the **conf** directory to this absolute path on the Flink client and Yarn nodes. + - Relative path: Perform the following steps to set the file path of **flink.keystore** and **flink.truststore** to the relative path and ensure that the directory where the Flink client command is executed can directly access the relative paths. + + #. Create a directory, for example, **ssl**, in **/opt/hadoopclient/Flink/flink/conf/**. + + **cd /opt/hadoopclient/Flink/flink/conf/** + + **mkdir ssl** + + #. Move the **flink.keystore** and **flink.truststore** files to the **/opt/hadoopclient/Flink/flink/conf/ssl/** directory. + + **mv flink.keystore ssl/** + + **mv flink.truststore ssl/** + + #. Change the values of the following parameters to relative paths in the **flink-conf.yaml** file: + + .. code-block:: + + security.ssl.internal.keystore: ssl/flink.keystore + security.ssl.internal.truststore: ssl/flink.truststore + +#. Run a wordcount job. + + .. important:: + + To submit or run jobs on Flink, the user must have the following permissions: + + - If Ranger authentication is enabled, the current user must belong to the **hadoop** group or the user has been granted the **/flink** read and write permissions in Ranger. + - If Ranger authentication is disabled, the current user must belong to the **hadoop** group. + + - Normal cluster (Kerberos authentication disabled) + + - Run the following commands to start a session and submit a job in the session: + + **yarn-session.sh -nm "**\ *session-name*\ **"** + + **flink run /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Run the following command to submit a single job on Yarn: + + **flink run -m yarn-cluster /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Security cluster (Kerberos authentication enabled) + + - If the **flink.keystore** and **flink.truststore** file are stored in the absolute path: + + - Run the following commands to start a session and submit a job in the session: + + **yarn-session.sh -nm "**\ *session-name*\ **"** + + **flink run /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Run the following command to submit a single job on Yarn: + + **flink run -m yarn-cluster /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - If the **flink.keystore** and **flink.truststore** files are stored in the relative path: + + - In the same directory of SSL, run the following commands to start a session and submit jobs in the session. The SSL directory is a relative path. For example, if the SSL directory is **opt/hadoopclient/Flink/flink/conf/**, then run the following commands in this directory: + + **yarn-session.sh -t ssl/ -nm "**\ *session-name*\ **"** + + **flink run /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Run the following command to submit a single job on Yarn: + + **flink run -m yarn-cluster -yt ssl/ /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + +#. After the job has been successfully submitted, the following information is displayed on the client: + + + .. figure:: /_static/images/en-us_image_0000001349289933.png + :alt: **Figure 1** Job submitted successfully on Yarn + + **Figure 1** Job submitted successfully on Yarn + + + .. figure:: /_static/images/en-us_image_0000001349289937.png + :alt: **Figure 2** Session started successfully + + **Figure 2** Session started successfully + + + .. figure:: /_static/images/en-us_image_0000001295930780.png + :alt: **Figure 3** Job submitted successfully in the session + + **Figure 3** Job submitted successfully in the session + +#. Go to the native YARN service page, find the application of the job, and click the application name to go to the job details page. For details, see `Viewing Flink Job Information `__. + + - If the job is not completed, click **Tracking URL** to go to the native Flink page and view the job running information. + + - If the job submitted in a session has been completed, you can click **Tracking URL** to log in to the native Flink service page to view job information. + + + .. figure:: /_static/images/en-us_image_0000001439150893.png + :alt: **Figure 4** Application + + **Figure 4** Application + +Using the Flink Client (MRS 3.x or Later) +----------------------------------------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd /opt/hadoopclient** + +#. Run the following command to initialize environment variables: + + **source /opt/hadoopclient/bigdata_env** + +#. If Kerberos authentication is enabled for the cluster, perform the following steps. If not, skip this whole step. + + a. Prepare a user for submitting Flink jobs. + + b. Log in to Manager and download the authentication credential. + + Log in to Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Choose **System** > **Permission** > **Manage User**. On the displayed page, locate the row that contains the added user, click **More** in the **Operation** column, and select **Download authentication credential**. + + c. Decompress the downloaded authentication credential package and copy the **user.keytab** file to the client node, for example, to the **/opt/hadoopclient/Flink/flink/conf** directory on the client node. If the client is installed on a node outside the cluster, copy the **krb5.conf** file to the **/etc/** directory on this node. + + d. Append the service IP address of the node where the client is installed, floating IP address of Manager, and IP address of the master node to the **jobmanager.web.access-control-allow-origin** and **jobmanager.web.allow-access-address** configuration item in the **/opt/hadoopclient/Flink/flink/conf/flink-conf.yaml** file. Use commas (,) to separate IP addresses. + + .. code-block:: + + jobmanager.web.access-control-allow-origin: xx.xx.xxx.xxx,xx.xx.xxx.xxx,xx.xx.xxx.xxx + jobmanager.web.allow-access-address: xx.xx.xxx.xxx,xx.xx.xxx.xxx,xx.xx.xxx.xxx + + .. note:: + + - To obtain the service IP address of the node where the client is installed, perform the following operations: + + - Node inside the cluster: + + In the navigation tree of the MRS management console, choose **Clusters > Active Clusters**, select a cluster, and click its name to switch to the cluster details page. + + On the **Nodes** tab page, view the IP address of the node where the client is installed. + + - Node outside the cluster: IP address of the ECS where the client is installed. + + - To obtain the floating IP address of Manager, perform the following operations: + + - In the navigation tree of the MRS management console, choose **Clusters > Active Clusters**, select a cluster, and click its name to switch to the cluster details page. + + On the **Nodes** tab page, view the **Name**. The node that contains **master1** in its name is the Master1 node. The node that contains **master2** in its name is the Master2 node. + + - Log in to the Master2 node remotely, and run the **ifconfig** command. In the command output, **eth0:wsom** indicates the floating IP address of MRS Manager. Record the value of **inet**. If the floating IP address of MRS Manager cannot be queried on the Master2 node, switch to the Master1 node to query and record the floating IP address. If there is only one Master node, query and record the cluster manager IP address of the Master node. + + e. Configure security authentication by adding the **keytab** path and username in the **/opt/hadoopclient/Flink/flink/conf/flink-conf.yaml** configuration file. + + **security.kerberos.login.keytab:** ** + + **security.kerberos.login.principal:** ** + + Example: + + security.kerberos.login.keytab: /opt/hadoopclient/Flink/flink/conf/user.keytab + + security.kerberos.login.principal: test + + f. Generate the **generate_keystore.sh** script and place it in the **bin** directory of the Flink client. In the **bin** directory of the Flink client, run the following command to perform security hardening. For details, see `Authentication and Encryption `__. Set **password** in the following command to a password for submitting jobs: + + **sh generate_keystore.sh ** + + The script automatically replaces the SSL value in the **/opt/hadoopclient/Flink/flink/conf/flink-conf.yaml** file. + + **sh generate_keystore.sh ** + + .. note:: + + After authentication and encryption, the **flink.keystore** and **flink.truststore** files are generated in the **conf** directory on the Flink client and the following configuration items are set to the default values in the **flink-conf.yaml** file: + + - Set **security.ssl.keystore** to the absolute path of the **flink.keystore** file. + - Set **security.ssl.truststore** to the absolute path of the **flink.truststore** file. + + - Set **security.cookie** to a random password automatically generated by the **generate_keystore.sh** script. + - By default, **security.ssl.encrypt.enabled** is set to **false** in the **flink-conf.yaml** file by default. The **generate_keystore.sh** script sets **security.ssl.key-password**, **security.ssl.keystore-password**, and **security.ssl.truststore-password** to the password entered when the **generate_keystore.sh** script is called. + + - For MRS 3.\ *x* or later, if ciphertext is required and **security.ssl.encrypt.enabled** is set to **true** in the **flink-conf.yaml** file, the **generate_keystore.sh** script does not set **security.ssl.key-password**, **security.ssl.keystore-password**, and **security.ssl.truststore-password**. To obtain the values, use the Manager plaintext encryption API by running **curl -k -i -u** *Username*\ **:**\ *Password* **-X POST -HContent-type:application/json -d '{"plainText":"**\ *Password*\ **"}' 'https://**\ *x.x.x.x*\ **:28443/web/api/v2/tools/encrypt'**. + + In the preceding command, *Username*\ **:**\ *Password* indicates the user name and password for logging in to the system. The password of **"plainText"** indicates the one used to call the **generate_keystore.sh** script. *x.x.x.x* indicates the floating IP address of Manager. + + g. Configure paths for the client to access the **flink.keystore** and **flink.truststore** files. + + - Absolute path: After the script is executed, the file path of **flink.keystore** and **flink.truststore** is automatically set to the absolute path **/opt/hadoopclient/Flink/flink/conf/** in the **flink-conf.yaml** file. In this case, you need to move the **flink.keystore** and **flink.truststore** files from the **conf** directory to this absolute path on the Flink client and Yarn nodes. + - Relative path: Perform the following steps to set the file path of **flink.keystore** and **flink.truststore** to the relative path and ensure that the directory where the Flink client command is executed can directly access the relative paths. + + #. Create a directory, for example, **ssl**, in **/opt/hadoopclient/Flink/flink/conf/**. + + **cd /opt/hadoopclient/Flink/flink/conf/** + + **mkdir ssl** + + #. Move the **flink.keystore** and **flink.truststore** files to the **/opt/hadoopclient/Flink/flink/conf/ssl/** directory. + + **mv flink.keystore ssl/** + + **mv flink.truststore ssl/** + + #. Change the values of the following parameters to relative paths in the **flink-conf.yaml** file: + + .. code-block:: + + security.ssl.keystore: ssl/flink.keystore + security.ssl.truststore: ssl/flink.truststore + +#. Run a wordcount job. + + .. important:: + + To submit or run jobs on Flink, the user must have the following permissions: + + - If Ranger authentication is enabled, the current user must belong to the **hadoop** group or the user has been granted the **/flink** read and write permissions in Ranger. + - If Ranger authentication is disabled, the current user must belong to the **hadoop** group. + + - Normal cluster (Kerberos authentication disabled) + + - Run the following commands to start a session and submit a job in the session: + + **yarn-session.sh -nm "**\ *session-name*\ **"** + + **flink run /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Run the following command to submit a single job on Yarn: + + **flink run -m yarn-cluster /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Security cluster (Kerberos authentication enabled) + + - If the **flink.keystore** and **flink.truststore** files are stored in the absolute path: + + - Run the following commands to start a session and submit a job in the session: + + **yarn-session.sh -nm "**\ *session-name*\ **"** + + **flink run /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Run the following command to submit a single job on Yarn: + + **flink run -m yarn-cluster /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - If the **flink.keystore** and **flink.truststore** file are stored in the relative path: + + - In the same directory of SSL, run the following commands to start a session and submit jobs in the session. The SSL directory is a relative path. For example, if the SSL directory is **opt/hadoopclient/Flink/flink/conf/**, then run the following commands in this directory: + + **yarn-session.sh -t ssl/ -nm "**\ *session-name*\ **"** + + **flink run /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + + - Run the following command to submit a single job on Yarn: + + **flink run -m yarn-cluster -yt ssl/ /opt/hadoopclient/Flink/flink/examples/streaming/WordCount.jar** + +#. After the job has been successfully submitted, the following information is displayed on the client: + + + .. figure:: /_static/images/en-us_image_0000001349090457.png + :alt: **Figure 5** Job submitted successfully on Yarn + + **Figure 5** Job submitted successfully on Yarn + + + .. figure:: /_static/images/en-us_image_0000001349170353.png + :alt: **Figure 6** Session started successfully + + **Figure 6** Session started successfully + + + .. figure:: /_static/images/en-us_image_0000001348770649.png + :alt: **Figure 7** Job submitted successfully in the session + + **Figure 7** Job submitted successfully in the session + +#. Go to the native YARN service page, find the application of the job, and click the application name to go to the job details page. For details, see `Viewing Flink Job Information `__. + + - If the job is not completed, click **Tracking URL** to go to the native Flink page and view the job running information. + + - If the job submitted in a session has been completed, you can click **Tracking URL** to log in to the native Flink service page to view job information. + + + .. figure:: /_static/images/en-us_image_0000001438951649.png + :alt: **Figure 8** Application + + **Figure 8** Application diff --git a/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/accessing_the_flink_web_ui.rst b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/accessing_the_flink_web_ui.rst new file mode 100644 index 0000000..75d5150 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/accessing_the_flink_web_ui.rst @@ -0,0 +1,40 @@ +:original_name: mrs_01_24019.html + +.. _mrs_01_24019: + +Accessing the Flink Web UI +========================== + +Scenario +-------- + +After Flink is installed in an MRS cluster, you can connect to clusters and data as well as manage stream tables and jobs using the Flink web UI. + +This section describes how to access the Flink web UI in an MRS cluster. + +.. note:: + + You are advised to use Google Chrome 50 or later to access the Flink web UI. The Internet Explorer may be incompatible with the Flink web UI. + +Impact on the System +-------------------- + +Site trust must be added to the browser when you access Manager and the Flink web UI for the first time. Otherwise, the Flink web UI cannot be accessed. + +Procedure +--------- + +#. Log in to FusionInsight Manager as a user with **FlinkServer Admin Privilege**. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Choose **Cluster** > **Services** > **Flink**. + +#. On the right of **Flink WebUI**, click the link to access the Flink web UI. + + The Flink web UI provides the following functions: + + - System management: + + - Cluster connection management allows you to create, view, edit, test, and delete a cluster connection. + - Data connection management allows you to create, view, edit, test, and delete a data connection. Data connection types include HDFS and Kafka. + - Application management allows you to create, view, and delete an application. + + - Stream table management allows you to create, view, edit, and delete a stream table. + - Job management allows you to create, view, start, develop, edit, stop, and delete a job. diff --git a/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/creating_a_cluster_connection_on_the_flink_web_ui.rst b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/creating_a_cluster_connection_on_the_flink_web_ui.rst new file mode 100644 index 0000000..3eb927c --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/creating_a_cluster_connection_on_the_flink_web_ui.rst @@ -0,0 +1,89 @@ +:original_name: mrs_01_24021.html + +.. _mrs_01_24021: + +Creating a Cluster Connection on the Flink Web UI +================================================= + +Scenario +-------- + +Different clusters can be accessed by configuring the cluster connection. + +Creating a Cluster Connection +----------------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. + +#. Choose **System Management** > **Cluster Connection Management**. The **Cluster Connection Management** page is displayed. + +#. Click **Create Cluster Connection**. On the displayed page, set parameters by referring to :ref:`Table 1 ` and click **OK**. + + .. _mrs_01_24021__table134890201518: + + .. table:: **Table 1** Parameters for creating a cluster connection + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+============================================================================================================================================================================================+ + | Cluster Connection Name | Name of the cluster connection, which can contain a maximum of 100 characters. Only letters, digits, and underscores (_) are allowed. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Description of the cluster connection name. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FusionInsight HD Version | Set a cluster version. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Secure Version | - If the secure version is used, select **Yes** for a security cluster. Enter the username and upload the user credential. | + | | - If not, select **No**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Username | The user must have the minimum permissions for accessing services in the cluster. The name can contain a maximum of 100 characters. Only letters, digits, and underscores (_) are allowed. | + | | | + | | This parameter is available only when **Secure Version** is set to **Yes**. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Client Profile | Client profile of the cluster, in TAR format. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | User Credential | User authentication credential in FusionInsight Manager in TAR format. | + | | | + | | This parameter is available only when **Secure Version** is set to **Yes**. | + | | | + | | Files can be uploaded only after the username is entered. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + To obtain the cluster client configuration files, perform the following steps: + + a. Log in to FusionInsight Manager and choose **Cluster** > **Dashboard**. + b. Choose **More** > **Download Client** > **Configuration Files Only**, select a platform type, and click **OK**. + + To obtain the user credential, perform the following steps: + + a. Log in to FusionInsight Manager and click **System**. + b. In the **Operation** column of the user, choose **More** > **Download Authentication Credential**, select a cluster, and click **OK**. + +Editing a Cluster Connection +---------------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Choose **System Management** > **Cluster Connection Management**. The **Cluster Connection Management** page is displayed. +#. In the **Operation** column of the item to be modified, click **Edit**. On the displayed page, modify the connection information by referring to :ref:`Table 1 ` and click **OK**. + +Testing a Cluster Connection +---------------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Choose **System Management** > **Cluster Connection Management**. The **Cluster Connection Management** page is displayed. +#. In the **Operation** column of the item to be tested, click **Test**. + +Searching for a Cluster Connection +---------------------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Choose **System Management** > **Cluster Connection Management**. The **Cluster Connection Management** page is displayed. +#. In the upper right corner of the page, you can enter a search criterion to search for and view the cluster connection based on **Cluster Connection Name**. + +Deleting a Cluster Connection +----------------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Choose **System Management** > **Cluster Connection Management**. The **Cluster Connection Management** page is displayed. +#. In the **Operation** column of the item to be deleted, click **Delete**, and click **OK** in the displayed page. diff --git a/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/creating_a_data_connection_on_the_flink_web_ui.rst b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/creating_a_data_connection_on_the_flink_web_ui.rst new file mode 100644 index 0000000..82269e3 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/creating_a_data_connection_on_the_flink_web_ui.rst @@ -0,0 +1,69 @@ +:original_name: mrs_01_24022.html + +.. _mrs_01_24022: + +Creating a Data Connection on the Flink Web UI +============================================== + +Scenario +-------- + +You can use data connections to access different data services. Currently, FlinkServer supports HDFS and Kafka data connections. + +Creating a Data Connection +-------------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. + +#. Choose **System Management** > **Data Connection Management**. The **Data Connection Management** page is displayed. + +#. Click **Create Data Connection**. On the displayed page, select a data connection type, enter information by referring to :ref:`Table 1 `, and click **OK**. + + .. _mrs_01_24022__table134890201518: + + .. table:: **Table 1** Parameters for creating a data connection + + +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+ + | Parameter | Description | Example Value | + +=======================+=============================================================================================================================================================+=====================================+ + | Data Connection Type | Type of the data connection, which can be **HDFS** or **Kafka**. | ``-`` | + +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+ + | Data Connection Name | Name of the data connection, which can contain a maximum of 100 characters. Only letters, digits, and underscores (_) are allowed. | ``-`` | + +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+ + | Cluster Connection | Cluster connection name in configuration management. | ``-`` | + +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+ + | Kafka broker | Connection information about Kafka broker instances. The format is *IP address*:*Port number*. Use commas (,) to separate multiple instances. | 192.168.0.1:21005,192.168.0.2:21005 | + | | | | + | | This parameter is mandatory for Kafka data connections. | | + +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+ + | Authentication Mode | - **SIMPLE**: indicates that the connected service is in non-security mode and does not need to be authenticated. | ``-`` | + | | - **KERBEROS**: indicates that the connected service is in security mode and the Kerberos protocol for security authentication is used for authentication. | | + +-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+ + +Editing a Data Connection +------------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Choose **System Management** > **Data Connection Management**. The **Data Connection Management** page is displayed. +#. In the **Operation** column of the item to be modified, click **Edit**. On the displayed page, modify the connection information by referring to :ref:`Table 1 ` and click **OK**. + +Testing a Data Connection +------------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Choose **System Management** > **Data Connection Management**. The **Data Connection Management** page is displayed. +#. In the **Operation** column of the item to be tested, click **Test**. + +Searching for a Data Connection +------------------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Choose **System Management** > **Data Connection Management**. The **Data Connection Management** page is displayed. +#. In the upper right corner of the page, you can search for a data connection by name. + +Deleting a Data Connection +-------------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Choose **System Management** > **Data Connection Management**. The **Data Connection Management** page is displayed. +#. In the **Operation** column of the item to be deleted, click **Delete**, and click **OK** in the displayed page. diff --git a/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/creating_an_application_on_the_flink_web_ui.rst b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/creating_an_application_on_the_flink_web_ui.rst new file mode 100644 index 0000000..f84ad16 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/creating_an_application_on_the_flink_web_ui.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_24020.html + +.. _mrs_01_24020: + +Creating an Application on the Flink Web UI +=========================================== + +Scenario +-------- + +Applications can be used to isolate different upper-layer services. + +Creating an Application +----------------------- + +#. Access the Flink web UI as a user with **FlinkServer Admin Privilege**. For details, see :ref:`Accessing the Flink Web UI `. + +#. Choose **System Management** > **Application Management**. + +#. Click **Create Application**. On the displayed page, set parameters by referring to :ref:`Table 1 ` and click **OK**. + + .. _mrs_01_24020__table2048293612324: + + .. table:: **Table 1** Parameters for creating an application + + +-------------+------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +=============+================================================================================================================================================+ + | Application | Name of the application to be created. The name can contain a maximum of 32 characters. Only letters, digits, and underscores (_) are allowed. | + +-------------+------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Description of the application to be created. The value can contain a maximum of 85 characters. | + +-------------+------------------------------------------------------------------------------------------------------------------------------------------------+ + + After the application is created, you can switch to the application to be operated in the upper left corner of the Flink web UI and develop jobs. diff --git a/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/flinkserver_permissions_management/authentication_based_on_users_and_roles.rst b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/flinkserver_permissions_management/authentication_based_on_users_and_roles.rst new file mode 100644 index 0000000..c1c46a3 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/flinkserver_permissions_management/authentication_based_on_users_and_roles.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_24049.html + +.. _mrs_01_24049: + +Authentication Based on Users and Roles +======================================= + +This section describes how to create and configure a FlinkServer role on Manager as the system administrator. A FlinkServer role can be configured with FlinkServer administrator permission and the permissions to edit and view applications. + +You need to set permissions for the specified user in FlinkServer so that they can update, query, and delete data. + +Prerequisites +------------- + +The system administrator has planned permissions based on business needs. + +Procedure +--------- + +#. Log in to Manager. + +#. Choose **System** > **Permission** > **Role**. + +#. On the displayed page, click **Create Role** and specify **Role Name** and **Description**. + +#. Set **Configure Resource Permission**. + + FlinkServer permissions are as follows: + + - **FlinkServer Admin Privilege**: highest-level permission. Users with the permission can perform service operations on all FlinkServer applications. + - **FlinkServer Application**: Users can set **application view** and **applications management** permissions on applications. + + .. table:: **Table 1** Setting a role + + +------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+ + | Scenario | Role Authorization | + +================================================+====================================================================================================================================+ + | Setting the administrator operation permission | In **Configure Resource Permission**, choose *Name of the desired cluster* > **Flink** and select **FlinkServer Admin Privilege**. | + +------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+ + | Setting a specified permission on applications | a. In the **Configure Resource Permission** table, choose *Name of the desired cluster* > **Flink** > **FlinkServer Application**. | + | | b. In the **Permission** column, select **application view** or **applications management**. | + +------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK**. Return to role management page. + + .. note:: + + After the FlinkServer role is created, create a FlinkServer user and bind the user to the role and user group. For details, see `Creating a User `__. diff --git a/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/flinkserver_permissions_management/index.rst b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/flinkserver_permissions_management/index.rst new file mode 100644 index 0000000..cad4bc7 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/flinkserver_permissions_management/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_24047.html + +.. _mrs_01_24047: + +FlinkServer Permissions Management +================================== + +- :ref:`Overview ` +- :ref:`Authentication Based on Users and Roles ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview + authentication_based_on_users_and_roles diff --git a/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/flinkserver_permissions_management/overview.rst b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/flinkserver_permissions_management/overview.rst new file mode 100644 index 0000000..4202d2a --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/flinkserver_permissions_management/overview.rst @@ -0,0 +1,26 @@ +:original_name: mrs_01_24048.html + +.. _mrs_01_24048: + +Overview +======== + +User **admin** of Manager does not have the FlinkServer service operation permission. To perform FlinkServer service operations, you need to grant related permission to the user. + +Applications (tenants) in FlinkServer are the maximum management scope, including cluster connection management, data connection management, application management, stream table management, and job management. + +There are three types of resource permissions for FlinkServer, as shown in :ref:`Table 1 `. + +.. _mrs_01_24048__table663518214115: + +.. table:: **Table 1** FlinkServer resource permissions + + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | Description | Remarks | + +======================================+=========================================================================================================================================================================+====================================================================================================================================================================+ + | FlinkServer administrator permission | Users who have the permission can edit and view all applications. | This is the highest-level permission of FlinkServer. If you have the FlinkServer administrator permission, you have the permission on all applications by default. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Application edit permission | Users who have the permission can create, edit, and delete cluster connections and data connections. They can also create stream tables as well as create and run jobs. | In addition, users who have the permission can view current applications. | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Application view permission | Users who have the permission can view applications. | ``-`` | + +--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/index.rst b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/index.rst new file mode 100644 index 0000000..746622e --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/index.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_24014.html + +.. _mrs_01_24014: + +Using the Flink Web UI +====================== + +- :ref:`Overview ` +- :ref:`FlinkServer Permissions Management ` +- :ref:`Accessing the Flink Web UI ` +- :ref:`Creating an Application on the Flink Web UI ` +- :ref:`Creating a Cluster Connection on the Flink Web UI ` +- :ref:`Creating a Data Connection on the Flink Web UI ` +- :ref:`Managing Tables on the Flink Web UI ` +- :ref:`Managing Jobs on the Flink Web UI ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview/index + flinkserver_permissions_management/index + accessing_the_flink_web_ui + creating_an_application_on_the_flink_web_ui + creating_a_cluster_connection_on_the_flink_web_ui + creating_a_data_connection_on_the_flink_web_ui + managing_tables_on_the_flink_web_ui + managing_jobs_on_the_flink_web_ui diff --git a/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/managing_jobs_on_the_flink_web_ui.rst b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/managing_jobs_on_the_flink_web_ui.rst new file mode 100644 index 0000000..011d5f6 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/managing_jobs_on_the_flink_web_ui.rst @@ -0,0 +1,199 @@ +:original_name: mrs_01_24024.html + +.. _mrs_01_24024: + +Managing Jobs on the Flink Web UI +================================= + +Scenario +-------- + +Define Flink jobs, including Flink SQL and Flink JAR jobs. + +Creating a Job +-------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. + +#. Click **Job Management**. The job management page is displayed. + +#. Click **Create Job**. On the displayed job creation page, set parameters by referring to :ref:`Table 1 ` and click **OK**. The job development page is displayed. + + .. _mrs_01_24024__table25451917135812: + + .. table:: **Table 1** Parameters for creating a job + + +-------------+----------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +=============+================================================================================================================+ + | Type | Job type, which can be **Flink SQL** or **Flink Jar**. | + +-------------+----------------------------------------------------------------------------------------------------------------+ + | Name | Job name, which can contain a maximum of 64 characters. Only letters, digits, and underscores (_) are allowed. | + +-------------+----------------------------------------------------------------------------------------------------------------+ + | Task Type | Type of the job data source, which can be a stream job or a batch job. | + +-------------+----------------------------------------------------------------------------------------------------------------+ + | Description | Job description, which can contain a maximum of 100 characters. | + +-------------+----------------------------------------------------------------------------------------------------------------+ + +#. .. _mrs_01_24024__li3175133444316: + + (Optional) If you need to develop a job immediately, configure the job on the job development page. + + - Creating a Flink SQL job + + a. Develop the job on the job development page. + + b. Click **Check Semantic** to check the input content and click **Format SQL** to format SQL statements. + + c. After the job SQL statements are developed, set basic and customized parameters as required by referring to :ref:`Table 2 ` and click **Save**. + + .. _mrs_01_24024__table4292165617332: + + .. table:: **Table 2** Basic parameters + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===============================================================================================================================================================================================================+ + | Parallelism | Number of concurrent jobs. The value must be a positive integer containing a maximum of 64 characters. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Maximum Operator Parallelism | Maximum parallelism of operators. The value must be a positive integer containing a maximum of 64 characters. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | JobManager Memory (MB) | Memory of JobManager The minimum value is **512** and the value can contain a maximum of 64 characters. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Submit Queue | Queue to which a job is submitted. If this parameter is not set, the **default** queue is used. The queue name can contain a maximum of 30 characters. Only letters, digits, and underscores (_) are allowed. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | taskManager | taskManager running parameters include: | + | | | + | | - **Slots**: If this parameter is left blank, the default value **1** is used. | + | | - **Memory (MB)**: The minimum value is **512**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Enable CheckPoint | Whether to enable CheckPoint. After CheckPoint is enabled, you need to configure the following information: | + | | | + | | - **Time Interval (ms)**: This parameter is mandatory. | + | | | + | | - **Mode**: This parameter is mandatory. | + | | | + | | The options are **EXACTLY_ONCE** and **AT_LEAST_ONCE**. | + | | | + | | - **Minimum Interval (ms)**: The minimum value is **10**. | + | | | + | | - **Timeout Duration**: The minimum value is **10**. | + | | | + | | - **Maximum Parallelism**: The value must be a positive integer containing a maximum of 64 characters. | + | | | + | | - **Whether to clean up**: This parameter can be set to **Yes** or **No**. | + | | | + | | - **Whether to enable incremental checkpoints**: This parameter can be set to **Yes** or **No**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Failure Recovery Policy | Failure recovery policy of a job. The options are as follows: | + | | | + | | - **fixed-delay**: You need to configure **Retry Times** and **Retry Interval (s)**. | + | | - **failure-rate**: You need to configure **Max Retry Times**, **Interval (min)**, and **Retry Interval (s)**. | + | | - **none** | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + d. Click **Submit** in the upper left corner to submit the job. + + - Creating a Flink JAR job + + a. Click **Select** to upload a local JAR file and set parameters by referring to :ref:`Table 3 ` or add customized parameters. + + .. _mrs_01_24024__table1388311381402: + + .. table:: **Table 3** Parameter configuration + + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +===================================+===============================================================================================================================================================================================================+ + | Local .jar File | Upload a local JAR file. The size of the file cannot exceed 10 MB. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Main Class | Main-Class type. | + | | | + | | - **Default**: By default, the class name is specified based on the **Mainfest** file in the JAR file. | + | | - **Specify**: Manually specify the class name. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Type | Class name. | + | | | + | | This parameter is available when **Main Class** is set to **Specify**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Class Parameter | Class parameters of Main-Class (parameters are separated by spaces). | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parallelism | Number of concurrent jobs. The value must be a positive integer containing a maximum of 64 characters. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | JobManager Memory (MB) | Memory of JobManager The minimum value is **512** and the value can contain a maximum of 64 characters. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Submit Queue | Queue to which a job is submitted. If this parameter is not set, the **default** queue is used. The queue name can contain a maximum of 30 characters. Only letters, digits, and underscores (_) are allowed. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | taskManager | taskManager running parameters include: | + | | | + | | - **Slots**: If this parameter is left blank, the default value **1** is used. | + | | - **Memory (MB)**: The minimum value is **512**. | + +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + b. Click **Save** to save the configuration and click **Submit** to submit the job. + +#. Return to the job management page. You can view information about the created job, including job name, type, status, kind, and description. + + .. note:: + + To read files related to the submitted job on the node as another user, ensure that the user and the user who submitted the job belong to the same user group and the user has been assigned the FlinkServer application management role. For example,\ **application view** is selected by referring to :ref:`Authentication Based on Users and Roles `. + +Starting a Job +-------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Click **Job Management**. The job management page is displayed. +#. In the **Operation** column of the job to be started, click **Start** to run the job. Jobs in the **Draft**, **Saved**, **Submission failed**, **Running succeeded**, **Running failed**, or **Stop** state can be started. + +Developing a Job +---------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Click **Job Management**. The job management page is displayed. +#. In the **Operation** column of the job to be developed, click **Develop** to go to the job development page. Develop a job by referring to :ref:`4 `. You can view created stream tables and fields in the list on the left. + +Editing the Job Name and Description +------------------------------------ + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Click **Job Management**. The job management page is displayed. +#. In the **Operation** column of the item to be modified, click **Edit**, modify **Description**, and click **OK** to save the modification. + +Viewing Job Details +------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Click **Job Management**. The job management page is displayed. +#. In the **Operation** column of the item to be viewed, choose **More** > **Job Monitoring** to view the job running details. + + .. note:: + + You can only view details about jobs in the **Running** state. + +Checkpoint Failure Recovery +--------------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Click **Job Management**. The job management page is displayed. +#. In the Operation column of the item to be restored, click **More** > **Checkpoint Failure Recovery**. You can perform checkpoint failure recovery for jobs in the **Running failed**, **Running Succeeded**, or **Stop** state. + +Filtering/Searching for Jobs +---------------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Click **Job Management**. The job management page is displayed. +#. In the upper right corner of the page, you can obtain job information by selecting the job name, or enter a keyword to search for a job. + +Stopping a Job +-------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Click **Job Management**. The job management page is displayed. +#. In the **Operation** column of the item to be stopped, click **Stop**. Jobs in the **Submitting**, **Submission succeeded**, or **Running** state can be stopped. + +Deleting a Job +-------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Click **Job Management**. The job management page is displayed. +#. In the **Operation** column of the item to be deleted, click **Delete**, and click **OK** in the displayed page. Jobs in the **Draft**, **Saved**, **Submission failed**, **Running succeeded**, **Running failed**, or **Stop** state can be deleted. diff --git a/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/managing_tables_on_the_flink_web_ui.rst b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/managing_tables_on_the_flink_web_ui.rst new file mode 100644 index 0000000..a96c2d3 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/managing_tables_on_the_flink_web_ui.rst @@ -0,0 +1,99 @@ +:original_name: mrs_01_24023.html + +.. _mrs_01_24023: + +Managing Tables on the Flink Web UI +=================================== + +Scenario +-------- + +Data tables can be used to define basic attributes and parameters of source tables, dimension tables, and output tables. + +Creating a Stream Table +----------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. + +#. Click **Table Management**. The table management page is displayed. + +#. Click **Create Stream Table**. On the stream table creation page, set parameters by referring to :ref:`Table 1 ` and click **OK**. + + .. _mrs_01_24023__table205858588169: + + .. table:: **Table 1** Parameters for creating a stream table + + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Remarks | + +========================+====================================================================================================================================================================================================================================+=====================================================================================================================================================+ + | Stream/Table Name | Stream/Table name, which can contain 1 to 64 characters. Only letters, digits, and underscores (_) are allowed. | Example: **flink_sink** | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Description | Stream/Table description information, which can contain 1 to 1024 characters. | ``-`` | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Mapping Table Type | Flink SQL does not provide the data storage function. Table creation is actually the creation of mapping for external data tables or storage. | ``-`` | + | | | | + | | The value can be **Kafka** or **HDFS**. | | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Type | Includes data source table **Source** and data result table **Sink**. Tables included in different mapping table types are as follows: | ``-`` | + | | | | + | | - Kafka: **Source** and **Sink** | | + | | - HDFS: **Source** and **Sink** | | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Data Connection | Name of the data connection. | ``-`` | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Topic | Kafka topic to be read. Multiple Kafka topics can be read. Use separators to separate topics. | ``-`` | + | | | | + | | This parameter is available when **Mapping Table Type** is set to **Kafka**. | | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | File Path | HDFS directory or a single file path to be transferred. | Example: | + | | | | + | | This parameter is available when **Mapping Table Type** is set to **HDFS**. | **/user/sqoop/** or **/user/sqoop/example.csv** | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Code | Codes corresponding to different mapping table types are as follows: | ``-`` | + | | | | + | | - Kafka: **CSV** and **JSON** | | + | | - HDFS: **CSV** | | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Prefix | When **Mapping Table Type** is set to **Kafka**, **Type** is set to **Source**, and **Code** is set to **JSON**, this parameter indicates the hierarchical prefixes of multi-layer nested JSON, which are separated by commas (,). | For example, **data,info** indicates that the content under **data** and **info** in the nested JSON file is used as the data input in JSON format. | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Separator | Has different meanings when **Mapping Table Type** is set to the following values: It is used as the separator of specified CSV fields. This parameter is available only when **Code** is set to **CSV**. | Example: comma (**,**) | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Row Separator | Line break in the file, including **\\r**, **\\n**, and **\\r\\n**. | ``-`` | + | | | | + | | This parameter is available when **Mapping Table Type** is set to **HDFS**. | | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Column Separator | Field separator in the file. | Example: comma (**,**) | + | | | | + | | This parameter is available when **Mapping Table Type** is set to **HDFS**. | | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Stream Table Structure | Stream/Table structure, including **Name** and **Type**. | ``-`` | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Proctime | System time, which is irrelevant to the data timestamp. That is, the time when the calculation is complete in Flink operators. | ``-`` | + | | | | + | | This parameter is available when **Type** is set to **Source**. | | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + | Event Time | Time when an event is generated, that is, the timestamp generated during data generation. | ``-`` | + | | | | + | | This parameter is available when **Type** is set to **Source**. | | + +------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ + +Editing a Stream Table +---------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Click **Table Management**. The table management page is displayed. +#. In the **Operation** column of the item to be modified, click **Edit**. On the displayed page, modify the stream table information by referring to :ref:`Table 1 ` and click **OK**. + +Searching for a stream table +---------------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Click **Table Management**. The table management page is displayed. +#. In the upper right corner of the page, you can enter a keyword to search for stream table information. + +Deleting a Stream Table +----------------------- + +#. Access the Flink web UI. For details, see :ref:`Accessing the Flink Web UI `. +#. Click **Table Management**. The table management page is displayed. +#. In the **Operation** column of the item to be deleted, click **Delete**, and click **OK** in the displayed page. diff --git a/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/overview/flink_web_ui_application_process.rst b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/overview/flink_web_ui_application_process.rst new file mode 100644 index 0000000..7d14b19 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/overview/flink_web_ui_application_process.rst @@ -0,0 +1,32 @@ +:original_name: mrs_01_24017.html + +.. _mrs_01_24017: + +Flink Web UI Application Process +================================ + +The Flink web UI application process is shown as follows: + + +.. figure:: /_static/images/en-us_image_0000001348770225.png + :alt: **Figure 1** Application process + + **Figure 1** Application process + +.. table:: **Table 1** Description of the Flink web UI application process + + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ + | Phase | Description | Reference Section | + +===========================================+==========================================================================================================================+=========================================================================+ + | Creating an application | Applications can be used to isolate different upper-layer services. | :ref:`Creating an Application on the Flink Web UI ` | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ + | Creating a cluster connection | Different clusters can be accessed by configuring the cluster connection. | :ref:`Creating a Cluster Connection on the Flink Web UI ` | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ + | Creating a data connection | Through data connections, you can access different data services, including HDFS and Kafka. | :ref:`Creating a Data Connection on the Flink Web UI ` | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ + | Creating a stream table | Data tables can be used to define basic attributes and parameters of source tables, dimension tables, and output tables. | :ref:`Managing Tables on the Flink Web UI ` | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ + | Creating a SQL/JAR job (stream/batch job) | APIs can be used to define Flink jobs, including Flink SQL and Flink Jar jobs. | :ref:`Managing Jobs on the Flink Web UI ` | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ + | Managing a job | A created job can be managed, including starting, developing, stopping, deleting, and editing the job. | :ref:`Managing Jobs on the Flink Web UI ` | + +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/overview/index.rst b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/overview/index.rst new file mode 100644 index 0000000..4766f91 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/overview/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_24015.html + +.. _mrs_01_24015: + +Overview +======== + +- :ref:`Introduction to Flink Web UI ` +- :ref:`Flink Web UI Application Process ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + introduction_to_flink_web_ui + flink_web_ui_application_process diff --git a/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/overview/introduction_to_flink_web_ui.rst b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/overview/introduction_to_flink_web_ui.rst new file mode 100644 index 0000000..b035791 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/using_the_flink_web_ui/overview/introduction_to_flink_web_ui.rst @@ -0,0 +1,65 @@ +:original_name: mrs_01_24016.html + +.. _mrs_01_24016: + +Introduction to Flink Web UI +============================ + +Flink web UI provides a web-based visual development platform. You only need to compile SQL statements to develop jobs, slashing the job development threshold. In addition, the exposure of platform capabilities allows service personnel to compile SQL statements for job development to quickly respond to requirements, greatly reducing the Flink job development workload. + +.. note:: + + This section applies to only MRS 3.1.0 or later. + +Flink Web UI Features +--------------------- + +The Flink web UI has the following features: + +- Enterprise-class visual O&M: GUI-based O&M management, job monitoring, and standardization of Flink SQL statements for job development. +- Quick cluster connection: After configuring the client and user credential key file, you can quickly access a cluster using the cluster connection function. +- Quick data connection: You can access a component by configuring the data connection function. If **Data Connection Type** is set to **HDFS**, you need to create a cluster connection. If **Authentication Mode** is set to **KERBEROS** for other data connection types, you need to create a cluster connection. If **Authentication Mode** is set to **SIMPLE**, you do not need to create a cluster connection. + + .. note:: + + If **Data Connection Type** is set to **Kafka**, **Authentication Type** cannot be set to **KERBEROS**. + +- Visual development platform: The input/output mapping table can be customized to meet the requirements of different input sources and output destinations. +- Easy to use GUI-based job management + +Key Web UI Capabilities +----------------------- + +:ref:`Table 1 ` shows the key capabilities provided by Flink web UI. + +.. _mrs_01_24016__table91592142421: + +.. table:: **Table 1** Key web UI capabilities + + +------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Item | Description | + +====================================+===========================================================================================================================================================================================================================================+ + | Batch-Stream convergence | - Batch jobs and stream jobs can be processed with a unified set of Flink SQL statements. | + +------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Flink SQL kernel capabilities | - Flink SQL supports customized window size, stream compute within 24 hours, and batch processing beyond 24 hours. | + | | - Flink SQL supports reading data from Kafka and HDFS, writing data to Kafka and HDFS. | + | | - A job can define multiple Flink SQL jobs, and multiple metrics can be combined into one job for computing. If a job contains same primary keys as well as same inputs and outputs, the job supports the computing of multiple windows. | + | | - The AVG, SUM, COUNT, MAX, and MIN statistical methods are supported. | + +------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Flink SQL functions on the console | - Cluster connection management allows you to configure clusters where services such as Kafka and HDFS reside. | + | | - Data connection management allows you to configure services such as Kafka and HDFS. | + | | - Data table management allows you to define data tables accessed by SQL statements and generate DDL statements. | + | | - Flink SQL job definition allows you to verify, parse, optimize, convert a job into a Flink job, and submit the job for running based on the entered SQL statements. | + +------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Flink job visual management | - Stream jobs and batch jobs can be defined in a visual manner. | + | | - Job resources, fault recovery policies, and checkpoint policies can be configured in a visual manner. | + | | - Status monitoring of stream and batch jobs are supported. | + | | - The Flink job O&M is enhanced, including redirection of the native monitoring page. | + +------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Performance and reliability | - Stream processing supports 24-hour window aggregation computing and millisecond-level performance. | + | | - Batch processing supports 90-day window aggregation computing, which can be completed in minutes. | + | | - Invalid data of stream processing and batch processing can be filtered out. | + | | - When HDFS data is read, the data can be filtered based on the calculation period in advance. | + | | - If the job definition platform is faulty or the service is degraded, jobs cannot be redefined, but the computing of existing jobs is not affected. | + | | - The automatic restart mechanism is provided for job failures. You can configure restart policies. | + +------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_flink/viewing_flink_job_information.rst b/doc/component-operation-guide/source/using_flink/viewing_flink_job_information.rst new file mode 100644 index 0000000..6d69372 --- /dev/null +++ b/doc/component-operation-guide/source/using_flink/viewing_flink_job_information.rst @@ -0,0 +1,29 @@ +:original_name: mrs_01_0784.html + +.. _mrs_01_0784: + +Viewing Flink Job Information +============================= + +You can view Flink job information on the Yarn web UI. + +Prerequisites +------------- + +The Flink service has been installed in a cluster. + +Accessing the Yarn Web UI +------------------------- + +#. Go to the Yarn service page. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager and choose **Services** > **Yarn** > **Yarn Summary**. + - For MRS 1.9.2 or later, click the cluster name on the MRS console and choose **Components** > **Yarn** > **Yarn Summary**. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + + - For MRS 3.\ *x* or later, log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn** > **Instance** > **Dashboard**. + +#. Click the link next to **ResourceManager WebUI** to go to the Yarn web UI page. diff --git a/doc/component-operation-guide/source/using_flume/common_issues_about_flume.rst b/doc/component-operation-guide/source/using_flume/common_issues_about_flume.rst new file mode 100644 index 0000000..d383761 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/common_issues_about_flume.rst @@ -0,0 +1,70 @@ +:original_name: mrs_01_1598.html + +.. _mrs_01_1598: + +Common Issues About Flume +========================= + +Flume logs are stored in **/var/log/Bigdata/flume/flume/flumeServer.log**. Most data transmission exceptions and data transmission failures are recorded in logs. You can run the following command: + +**tailf /var/log/Bigdata/flume/flume/flumeServer.log** + +- Problem: After the configuration file is uploaded, an exception occurs. After the configuration file is uploaded again, the scenario requirements are still not met, but no exception is recorded in the log. + + Solution: Restart the Flume process, run the **kill -9** *Process code* to kill the process code, and view the logs. + +- Issue: **"java.lang.IllegalArgumentException: Keytab is not a readable file: /opt/test/conf/user.keytab"** is displayed when HDFS is connected. + + Solution: Grant the read and write permissions to the Flume running user. + +- Problem: The following error is reported when the Flume client is connected to Kafka: + + .. code-block:: + + Caused by: java.io.IOException: /opt/FlumeClient/fusioninsight-flume-1.9.0/cof//jaas.conf (No such file or directory) + + Solution: Add the **jaas.conf** configuration file and save it to the **conf** directory of the Flume client. + + **vi jaas.conf** + + .. code-block:: + + KafkaClient { + com.sun.security.auth.module.Krb5LoginModule required + useKeyTab=true + keyTab="/opt/test/conf/user.keytab" + principal="flume_hdfs@" + useTicketCache=false + storeKey=true + debug=true; + }; + + Values of **keyTab** and **principal** vary depending on the actual situation. + +- Problem: The following error is reported when the Flume client is connected to HBase: + + .. code-block:: + + Caused by: java.io.IOException: /opt/FlumeClient/fusioninsight-flume-1.9.0/cof//jaas.conf (No such file or directory) + + Solution: Add the **jaas.conf** configuration file and save it to the **conf** directory of the Flume client. + + **vi jaas.conf** + + .. code-block:: + + Client { + com.sun.security.auth.module.Krb5LoginModule required + useKeyTab=true + keyTab="/opt/test/conf/user.keytab" + principal="flume_hbase@" + useTicketCache=false + storeKey=true + debug=true; + }; + + Values of **keyTab** and **principal** vary depending on the actual situation. + +- Question: After the configuration file is submitted, the Flume Agent occupies resources. How do I restore the Flume Agent to the state when the configuration file is not uploaded? + + Solution: Submit an empty **properties.properties** file. diff --git a/doc/component-operation-guide/source/using_flume/configuring_the_flume_service_model/index.rst b/doc/component-operation-guide/source/using_flume/configuring_the_flume_service_model/index.rst new file mode 100644 index 0000000..39976b5 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/configuring_the_flume_service_model/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_1073.html + +.. _mrs_01_1073: + +Configuring the Flume Service Model +=================================== + +- :ref:`Overview ` +- :ref:`Service Model Configuration Guide ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview + service_model_configuration_guide diff --git a/doc/component-operation-guide/source/using_flume/configuring_the_flume_service_model/overview.rst b/doc/component-operation-guide/source/using_flume/configuring_the_flume_service_model/overview.rst new file mode 100644 index 0000000..6bdf542 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/configuring_the_flume_service_model/overview.rst @@ -0,0 +1,12 @@ +:original_name: mrs_01_1074.html + +.. _mrs_01_1074: + +Overview +======== + +This section applies to MRS 3.\ *x* or later. + +Guide a reasonable Flume service configuration by providing performance differences between Flume common modules, to avoid a nonstandard overall service performance caused when a frontend Source and a backend Sink do not match in performance. + +Only single channels are compared for description. diff --git a/doc/component-operation-guide/source/using_flume/configuring_the_flume_service_model/service_model_configuration_guide.rst b/doc/component-operation-guide/source/using_flume/configuring_the_flume_service_model/service_model_configuration_guide.rst new file mode 100644 index 0000000..f834bb1 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/configuring_the_flume_service_model/service_model_configuration_guide.rst @@ -0,0 +1,250 @@ +:original_name: mrs_01_1075.html + +.. _mrs_01_1075: + +Service Model Configuration Guide +================================= + +This section applies to MRS 3.\ *x* or later. + +During Flume service configuration and module selection, the ultimate throughput of a sink must be greater than the maximum throughput of a source. Otherwise, in extreme load scenarios, the write speed of the source to a channel is greater than the read speed of sink from channel. Therefore, the channel is fully occupied due to frequent usage, and the performance is affected. + +Avro Source and Avro Sink are usually used in pairs to transfer data between multiple Flume Agents. Therefore, Avro Source and Avro Sink do not become a performance bottleneck in general scenarios. + +Inter-Module Performance +------------------------ + +Based on comparison between the limit performances of modules, Kafka Sink and HDFS Sink can meet the throughput requirements when the front-end is SpoolDir Source. However, HBase Sink could become performance bottlenecks due to the low write performances thereof. As a result, data is stacked in Channel. If you have to use HBase Sink or other sinks that are prone to become performance bottlenecks, you can use **Channel Selector** or **Sink Group** to meet performance requirements. + +Channel Selector +---------------- + +A channel selector allows a source to connect to multiple channels. Data of the source can be distributed or copied by selecting different types of selectors. Currently, a channel selector provided by Flume can be a replicating channel selector or a multiplexing channel selector. + +Replicating: indicates that the data of the source is synchronized to all channels. + +Multiplexing: indicates that based on the value of a specific field of the header of an event, a channel is selected to send the data. In this way, the data is distributed based on a service type. + +- Replicating configuration example: + + .. code-block:: + + client.sources = kafkasource + client.channels = channel1 channel2 + client.sources.kafkasource.type = org.apache.flume.source.kafka.KafkaSource + client.sources.kafkasource.kafka.topics = topic1,topic2 + client.sources.kafkasource.kafka.consumer.group.id = flume + client.sources.kafkasource.kafka.bootstrap.servers = 10.69.112.108:21007 + client.sources.kafkasource.kafka.security.protocol = SASL_PLAINTEXT + client.sources.kafkasource.batchDurationMillis = 1000 + client.sources.kafkasource.batchSize = 800 + client.sources.kafkasource.channels = channel1 c el2 + + client.sources.kafkasource.selector.type = replicating + client.sources.kafkasource.selector.optional = channel2 + + .. table:: **Table 1** Parameters in the Replicating configuration example + + +-------------------+---------------+-------------------------------------------------------+ + | Parameter | Default Value | Description | + +===================+===============+=======================================================+ + | Selector.type | replicating | Selector type. Set this parameter to **replicating**. | + +-------------------+---------------+-------------------------------------------------------+ + | Selector.optional | ``-`` | Optional channel. Configure this parameter as a list. | + +-------------------+---------------+-------------------------------------------------------+ + +- Multiplexing configuration example: + + .. code-block:: + + client.sources = kafkasource + client.channels = channel1 channel2 + client.sources.kafkasource.type = org.apache.flume.source.kafka.KafkaSource + client.sources.kafkasource.kafka.topics = topic1,topic2 + client.sources.kafkasource.kafka.consumer.group.id = flume + client.sources.kafkasource.kafka.bootstrap.servers = 10.69.112.108:21007 + client.sources.kafkasource.kafka.security.protocol = SASL_PLAINTEXT + client.sources.kafkasource.batchDurationMillis = 1000 + client.sources.kafkasource.batchSize = 800 + client.sources.kafkasource.channels = channel1 channel2 + + client.sources.kafkasource.selector.type = multiplexing + client.sources.kafkasource.selector.header = myheader + client.sources.kafkasource.selector.mapping.topic1 = channel1 + client.sources.kafkasource.selector.mapping.topic2 = channel2 + client.sources.kafkasource.selector.default = channel1 + + .. table:: **Table 2** Parameters in the Multiplexing configuration example + + +---------------------+-----------------------+--------------------------------------------------------+ + | Parameter | Default Value | Description | + +=====================+=======================+========================================================+ + | Selector.type | replicating | Selector type. Set this parameter to **multiplexing**. | + +---------------------+-----------------------+--------------------------------------------------------+ + | Selector.header | Flume.selector.header | ``-`` | + +---------------------+-----------------------+--------------------------------------------------------+ + | Selector.default | ``-`` | ``-`` | + +---------------------+-----------------------+--------------------------------------------------------+ + | Selector.mapping.\* | ``-`` | ``-`` | + +---------------------+-----------------------+--------------------------------------------------------+ + + In a multiplexing selector example, select a field whose name is topic from the header of the event. When the value of the topic field in the header is topic1, send the event to a channel 1; or when the value of the topic field in the header is topic2, send the event to a channel 2. + + Selectors need to use a specific header of an event in a source to select a channel, and need to select a proper header based on a service scenario to distribute data. + +SinkGroup +--------- + +When the performance of a backend single sink is insufficient, and high reliability or heterogeneous output is required, you can use a sink group to connect a specified channel to multiple sinks, thereby meeting use requirements. Currently, Flume provides two types of sink processors to manage sinks in a sink group. The types are load balancing and failover. + +Failover: Indicates that there is only one active sink in the sink group each time, and the other sinks are on standby and inactive. When the active sink becomes faulty, one of the inactive sinks is selected based on priorities to take over services, so as to ensure that data is not lost. This is used in high-reliability scenarios. + +Load balancing: Indicates that all sinks in the sink group are active. Each sink obtains data from the channel and processes the data. In addition, during running, loads of all sinks in the sink group are balanced. This is used in performance improvement scenarios. + +- Load balancing configuration examples: + + .. code-block:: + + client.sources = source1 + client.sinks = sink1 sink2 + client.channels = channel1 + + client.sinkgroups = g1 + client.sinkgroups.g1.sinks = sink1 sink2 + client.sinkgroups.g1.processor.type = load_balance + client.sinkgroups.g1.processor.backoff = true + client.sinkgroups.g1.processor.selector = random + + client.sinks.sink1.type = logger + client.sinks.sink1.channel = channel1 + + client.sinks.sink2.type = logger + client.sinks.sink2.channel = channel1 + + .. table:: **Table 3** Parameters of Load Balancing configuration examples + + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +===============================+===============+==============================================================================================================================+ + | sinks | ``-`` | Specifies the sink list of the sink group. Multiple sinks are separated by spaces. | + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------+ + | processor.type | default | Specifies the type of a processor. Set this parameter to **load_balance**. | + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------+ + | processor.backoff | false | Indicates whether to back off failed sinks exponentially. | + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------+ + | processor.selector | round_robin | Specifies the selection mechanism. It must be round_robin, random, or a customized class that inherits AbstractSinkSelector. | + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------+ + | processor.selector.maxTimeOut | 30000 | Specifies the time for masking a faulty sink. The default value is 30,000 ms. | + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------+ + +- Failover configuration examples: + + .. code-block:: + + client.sources = source1 + client.sinks = sink1 sink2 + client.channels = channel1 + + client.sinkgroups = g1 + client.sinkgroups.g1.sinks = sink1 sink2 + client.sinkgroups.g1.processor.type = failover + client.sinkgroups.g1.processor.priority.sink1 = 10 + client.sinkgroups.g1.processor.priority.sink2 = 5 + client.sinkgroups.g1.processor.maxpenalty = 10000 + + client.sinks.sink1.type = logger + client.sinks.sink1.channel = channel1 + + client.sinks.sink2.type = logger + client.sinks.sink2.channel = channel1 + + .. table:: **Table 4** Parameters in the **failover** configuration example + + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +===============================+===============+==========================================================================================================================================================================================================================================================================================+ + | sinks | ``-`` | Specifies the sink list of the sink group. Multiple sinks are separated by spaces. | + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | processor.type | default | Specifies the type of a processor. Set this parameter to **failover**. | + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | processor.priority. | ``-`` | Priority. **** must be defined in description of sinks. A sink having a higher priority is activated earlier. A larger value indicates a higher priority. **Note**: If there are multiple sinks, their priorities must be different. Otherwise, only one of them takes effect. | + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | processor.maxpenalty | 30000 | Specifies the maximum backoff time of failed sinks (unit: ms). | + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Interceptors +------------ + +The Flume interceptor supports modification or discarding of basic unit events during data transmission. You can specify the class name list of built-in interceptors in Flume or develop customized interceptors to modify or discard events. The following table lists the built-in interceptors in Flume. A complex example is used in this section. Other users can configure and use interceptions as required. + +.. note:: + + 1. The interceptor is used between the sources and channels of Flume. Most sources provide parameters for configuring interceptors. You can set the parameters as required. + + 2. Flume allows multiple interceptors to be configured for a source. The interceptor names are separated by spaces. + + 3. The specified interceptor sequence is the order in which they are called. + + 4. The contents inserted by the interceptor in the header can be read and used in sink. + +.. table:: **Table 5** Types of built-in interceptors in Flume + + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Interceptor Type | Description | + +================================+====================================================================================================================================================================================================+ + | Timestamp Interceptor | The interceptor inserts a timestamp into the header of an event. | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Host Interceptor | The interceptor inserts the IP address or host name of the node where the agent is located into the Header of an event. | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Remove Header Interceptor | The interceptor discards the corresponding event based on the strings that matches the regular expression contained in the event header. | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | UUID Interceptor | The interceptor generates a UUID string for the header of each event. | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Search and Replace Interceptor | The interceptor provides a simple string-based search and replacement function based on Java regular expressions. The rule is the same as that of Java Matcher.replaceAll(). | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Regex Filtering Interceptor | The interceptor uses the body of an event as a text file and matches the configured regular expression to filter events. The provided regular expression can be used to exclude or include events. | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Regex Extractor Interceptor | The interceptor extracts content from the original events using a regular expression and adds the content to the header of events. | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +**Regex Filtering Interceptor** is used as an example to describe how to use the interceptor. (For other types of interceptions, see the configuration provided on the official website.) + +.. table:: **Table 6** Parameter configuration for **Regex Filtering Interceptor** + + +---------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +===============+===============+===========================================================================================================================================================+ + | type | ``-`` | Specifies the component type name. The value must be **regex_filter**. | + +---------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | regex | ``-`` | Specifies the regular expression used to match events. | + +---------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + | excludeEvents | false | By default, the matched events are collected. If this parameter is set to **true**, the matched events are deleted and the unmatched events are retained. | + +---------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Configuration example (netcat tcp is used as the source, and logger is used as the sink). After configuring the preceding parameters, run the **telnet** *Host name or IP address* **44444** command on the host where the Linux operating system is run, and enter a string that complies with the regular expression and another does not comply with the regular expression. The log shows that only the matched string is transmitted. + +.. code-block:: + + #define the source, channel, sink + server.sources = r1 + + server.channels = c1 + server.sinks = k1 + + #config the source + server.sources.r1.type = netcat + server.sources.r1.bind = ${Host IP address} + server.sources.r1.port = 44444 + server.sources.r1.interceptors= i1 + server.sources.r1.interceptors.i1.type= regex_filter + server.sources.r1.interceptors.i1.regex= (flume)|(myflume) + server.sources.r1.interceptors.i1.excludeEvents= false + server.sources.r1.channels = c1 + + #config the channel + server.channels.c1.type = memory + server.channels.c1.capacity = 1000 + server.channels.c1.transactionCapacity = 100 + #config the sink + server.sinks.k1.type = logger + server.sinks.k1.channel = c1 diff --git a/doc/component-operation-guide/source/using_flume/connecting_flume_to_kafka_in_security_mode.rst b/doc/component-operation-guide/source/using_flume/connecting_flume_to_kafka_in_security_mode.rst new file mode 100644 index 0000000..244ea86 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/connecting_flume_to_kafka_in_security_mode.rst @@ -0,0 +1,38 @@ +:original_name: mrs_01_1071.html + +.. _mrs_01_1071: + +Connecting Flume to Kafka in Security Mode +========================================== + +Scenario +-------- + +This section describes how to connect to Kafka using the Flume client in security mode. + +This section applies to MRS 3.\ *x* or later. + +Procedure +--------- + +#. Create a **jaas.conf** file and save it to **${**\ *Flume client installation directory*\ **} /conf**. The content of the **jaas.conf** file is as follows: + + .. code-block:: + + KafkaClient { + com.sun.security.auth.module.Krb5LoginModule required + useKeyTab=true + keyTab="/opt/test/conf/user.keytab" + principal="flume_hdfs@" + useTicketCache=false + storeKey=true + debug=true; + }; + + Set **keyTab** and **principal** based on site requirements. The configured **principal** must have certain kafka permissions. + +#. Configure services. Set the port number of **kafka.bootstrap.servers** to **21007**, and set **kafka.security.protocol** to **SASL_PLAINTEXT**. + +#. If the domain name of the cluster where Kafka is located is changed, change the value of *-Dkerberos.domain.name* in the **flume-env.sh** file in **$**\ {*Flume client installation directory*} **/conf/** based on the site requirements. + +#. Upload the configured **properties.properties** file to **$**\ {*Flume client installation directory*} **/conf**. diff --git a/doc/component-operation-guide/source/using_flume/connecting_flume_with_hive_in_security_mode.rst b/doc/component-operation-guide/source/using_flume/connecting_flume_with_hive_in_security_mode.rst new file mode 100644 index 0000000..925eeee --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/connecting_flume_with_hive_in_security_mode.rst @@ -0,0 +1,179 @@ +:original_name: mrs_01_1072.html + +.. _mrs_01_1072: + +Connecting Flume with Hive in Security Mode +=========================================== + +Scenario +-------- + +This section describes how to use Flume to connect to Hive (version 3.1.0) in the cluster. + +This section applies to MRS 3.\ *x* or later. + +Prerequisites +------------- + +Flume and Hive have been correctly installed in the cluster. The services are running properly, and no alarm is reported. + +Procedure +--------- + +#. Import the following JAR packages to the lib directory (client/server) of the Flume instance to be tested as user **omm**: + + - antlr-2.7.7.jar + - antlr-runtime-3.4.jar + - calcite-core-1.16.0.jar + - hadoop-mapreduce-client-core-3.1.1.jar + - hive-beeline-3.1.0.jar + - hive-cli-3.1.0.jar + - hive-common-3.1.0.jar + - hive-exec-3.1.0.jar + - hive-hcatalog-core-3.1.0.jar + - hive-hcatalog-pig-adapter-3.1.0.jar + - hive-hcatalog-server-extensions-3.1.0.jar + - hive-hcatalog-streaming-3.1.0.jar + - hive-metastore-3.1.0.jar + - hive-service-3.1.0.jar + - libfb303-0.9.3.jar + - hadoop-plugins-1.0.jar + + You can obtain the JAR package from the Hive installation directory and restart the Flume process to ensure that the JAR package is loaded to the running environment. + +#. Set Hive configuration items. + + On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Configurations** > **All Configurations** > **HiveServer** > **Customization** > **hive.server.customized.configs**. + + Example configurations: + + +----------------------------------+------------------------------------------------+ + | Name | Value | + +==================================+================================================+ + | hive.support.concurrency | true | + +----------------------------------+------------------------------------------------+ + | hive.exec.dynamic.partition.mode | nonstrict | + +----------------------------------+------------------------------------------------+ + | hive.txn.manager | org.apache.hadoop.hive.ql.lockmgr.DbTxnManager | + +----------------------------------+------------------------------------------------+ + | hive.compactor.initiator.on | true | + +----------------------------------+------------------------------------------------+ + | hive.compactor.worker.threads | 1 | + +----------------------------------+------------------------------------------------+ + +#. Prepare the system user **flume_hive** who has the supergroup and Hive permissions, install the client, and create the required Hive table. + + Example: + + a. The cluster client has been correctly installed. For example, the installation directory is **/opt/client**. + + b. Run the following command to authenticate the user: + + **cd /opt/client** + + **source bigdata_env** + + **kinit flume_hive** + + c. Run the **beeline** command and run the following table creation statement: + + .. code-block:: + + create table flume_multi_type_part(id string, msg string) + partitioned by (country string, year_month string, day string) + clustered by (id) into 5 buckets + stored as orc TBLPROPERTIES('transactional'='true'); + + d. Run the **select \* from** *Table name*\ **;** command to query data in the table. + + In this case, the number of data records in the table is **0**. + +#. Prepare related configuration files. Assume that the client installation package is stored in **/opt/FusionInsight_Cluster_1_Services_ClientConfig**. + + a. Obtain the following files from the $\ *Client decompression directory*\ **/Hive/config** directory: + + - hivemetastore-site.xml + - hive-site.xml + + b. Obtain the following files from the **$**\ *Client decompression directory*\ **/HDFS/config** directory: + + core-site.xml + + c. Create a directory on the host where the Flume instance is started and save the prepared files to the created directory. + + Example: **/opt/hivesink-conf/hive-site.xml**. + + d. Copy all property configurations in the **hivemetastore-site.xml** file to the **hive-site.xml** file and ensure that the configurations are placed before the original configurations. + + Data is loaded in sequence in Hive. + + .. note:: + + Ensure that the Flume running user **omm** has the read and write permissions on the directory where the configuration file is stored. + +#. Observe the result. + + On the Hive client, run the **select \* from** *Table name*\ **;** command. Check whether the corresponding data has been written to the Hive table. + +Examples +-------- + +Flume configuration example (SpoolDir--Mem--Hive): + +.. code-block:: + + server.sources = spool_source + server.channels = mem_channel + server.sinks = Hive_Sink + + #config the source + server.sources.spool_source.type = spooldir + server.sources.spool_source.spoolDir = /tmp/testflume + server.sources.spool_source.montime = + server.sources.spool_source.fileSuffix =.COMPLETED + server.sources.spool_source.deletePolicy = never + server.sources.spool_source.trackerDir =.flumespool + server.sources.spool_source.ignorePattern = ^$ + server.sources.spool_source.batchSize = 20 + server.sources.spool_source.inputCharset =UTF-8 + server.sources.spool_source.selector.type = replicating + server.sources.spool_source.fileHeader = false + server.sources.spool_source.fileHeaderKey = file + server.sources.spool_source.basenameHeaderKey= basename + server.sources.spool_source.deserializer = LINE + server.sources.spool_source.deserializer.maxBatchLine= 1 + server.sources.spool_source.deserializer.maxLineLength= 2048 + server.sources.spool_source.channels = mem_channel + + #config the channel + server.channels.mem_channel.type = memory + server.channels.mem_channel.capacity =10000 + server.channels.mem_channel.transactionCapacity= 2000 + server.channels.mem_channel.channelfullcount= 10 + server.channels.mem_channel.keep-alive = 3 + server.channels.mem_channel.byteCapacity = + server.channels.mem_channel.byteCapacityBufferPercentage= 20 + + #config the sink + server.sinks.Hive_Sink.type = hive + server.sinks.Hive_Sink.channel = mem_channel + server.sinks.Hive_Sink.hive.metastore = thrift://${any MetaStore service IP address}:21088 + server.sinks.Hive_Sink.hive.hiveSite = /opt/hivesink-conf/hive-site.xml + server.sinks.Hive_Sink.hive.coreSite = /opt/hivesink-conf/core-site.xml + server.sinks.Hive_Sink.hive.metastoreSite = /opt/hivesink-conf/hivemeatastore-site.xml + server.sinks.Hive_Sink.hive.database = default + server.sinks.Hive_Sink.hive.table = flume_multi_type_part + server.sinks.Hive_Sink.hive.partition = Tag,%Y-%m,%d + server.sinks.Hive_Sink.hive.txnsPerBatchAsk= 100 + server.sinks.Hive_Sink.hive.autoCreatePartitions= true + server.sinks.Hive_Sink.useLocalTimeStamp = true + server.sinks.Hive_Sink.batchSize = 1000 + server.sinks.Hive_Sink.hive.kerberosPrincipal= super1 + server.sinks.Hive_Sink.hive.kerberosKeytab= /opt/mykeytab/user.keytab + server.sinks.Hive_Sink.round = true + server.sinks.Hive_Sink.roundValue = 10 + server.sinks.Hive_Sink.roundUnit = minute + server.sinks.Hive_Sink.serializer = DELIMITED + server.sinks.Hive_Sink.serializer.delimiter= ";" + server.sinks.Hive_Sink.serializer.serdeSeparator= ';' + server.sinks.Hive_Sink.serializer.fieldnames= id,msg diff --git a/doc/component-operation-guide/source/using_flume/encrypted_transmission/configuring_the_encrypted_transmission.rst b/doc/component-operation-guide/source/using_flume/encrypted_transmission/configuring_the_encrypted_transmission.rst new file mode 100644 index 0000000..57c6153 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/encrypted_transmission/configuring_the_encrypted_transmission.rst @@ -0,0 +1,328 @@ +:original_name: mrs_01_1069.html + +.. _mrs_01_1069: + +Configuring the Encrypted Transmission +====================================== + +Scenario +-------- + +This section describes how to configure the server and client parameters of the Flume service (including the Flume and MonitorServer roles) after the cluster is installed to ensure proper running of the service. + +This section applies to MRS 3.\ *x* or later clusters. + +Prerequisites +------------- + +The cluster and Flume service have been installed. + +Procedure +--------- + +#. Generate the certificate trust lists of the server and client of the Flume role respectively. + + a. Remotely log in to the node using ECM where the Flume server is to be installed as user **omm**. Go to the **${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/bin** directory. + + **cd ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/bin** + + .. note:: + + The version 8.1.0.1 is used as an example. Replace it with the actual version number. + + b. Run the following command to generate and export the server and client certificates of the Flume role: + + **sh geneJKS.sh -f** *xxx* **-g** *xxx* + + The generated certificate is saved in the **${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf** path . + + - **flume_sChat.jks** is the certificate library of the Flume role server. **flume_sChat.crt** is the exported file of the **flume_sChat.jks** certificate. **-f** indicates the password of the certificate and certificate library. + - **flume_cChat.jks** is the certificate library of the Flume role client. **flume_cChat.crt** is the exported file of the **flume_cChat.jks** certificate. **-g** indicates the password of the certificate and certificate library. + - **flume_sChatt.jks** and **flume_cChatt.jks** are the SSL certificate trust lists of the Flume server and client, respectively. + + .. note:: + + All user-defined passwords involved in this section must meet the following requirements: + + - The password must contain at least four types of uppercase letters, lowercase letters, digits, and special characters. + - The password must contain 8 to 64 characters. + - It is recommended that the user-defined passwords be changed periodically (for example, every three months), and certificates and trust lists be generated again to ensure security. + +#. Configure the server parameters of the Flume role and upload the configuration file to the cluster. + + a. Remotely log in to any node where the Flume role is located as user **omm** using ECM. Run the following command to go to the ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/bin directory: + + **cd ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/bin** + + b. .. _mrs_01_1069__l9f81f0e892824e79a1414cd62cce07ba: + + Run the following command to generate and obtain Flume server keystore password, trust list password, and keystore-password encrypted private key information. Enter the password twice and confirm the password. It is the password of the **flume_sChat.jks** certificate library. + + **./genPwFile.sh** + + **cat password.property** + + c. Use the Flume configuration tool on the FusionInsight Manager portal to configure the server parameters and generate the configuration file. + + #. Log in to FusionInsight Manager. Choose **Services** > **Flume** > **Configuration Tool**. + + #. Set **Agent Name** to **server**. Select the source, channel, and sink to be used, drag them to the GUI on the right, and connect them. + + For example, use Avro Source, File Channel, and HDFS Sink. + + #. Double-click the source, channel, and sink. Set corresponding configuration parameters by seeing :ref:`Table 1 ` based on the actual environment. + + .. note:: + + - If the server parameters of the Flume role have been configured, you can choose **Services** > **Flume** > **Instance** on FusionInsight Manager. Then select the corresponding Flume role instance and click the **Download** button behind the **flume.config.file** parameter on the **Instance Configurations** page to obtain the existing server parameter configuration file. Choose **Services** > **Flume** > **Import** to change the relevant configuration items of encrypted transmission after the file is imported. + - It is recommended that the numbers of Sources, Channels, and Sinks do not exceed 40 during configuration file import. Otherwise, the response time may be very long. + + #. Click **Export** to save the **properties.properties** configuration file to the local. + + .. _mrs_01_1069__te7d3219190a74a0aba371689e6bdb84d: + + .. table:: **Table 1** Parameters to be modified of the Flume role server + + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +=======================+===================================================================================================================+===================================================================================================================+ + | ssl | Specifies whether to enable the SSL authentication. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | keystore | Indicates the server certificate. | ${BIGDATA_HOME\ **}**/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/flume_sChat.jks | + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | keystore-password | Specifies the password of the key library, which is the password required to obtain the keystore information. | ``-`` | + | | | | + | | Enter the value of password obtained in :ref:`2.b `. | | + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | truststore | Indicates the SSL certificate trust list of the server. | ${BIGDATA_HOME\ **}**/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/flume_sChatt.jks | + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + | truststore-password | Specifies the trust list password, which is the password required to obtain the truststore information. | ``-`` | + | | | | + | | Enter the value of password obtained in :ref:`2.b `. | | + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+ + + d. Log in to FusionInsight Manager and choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume**. On the displayed page, click the **Flume** role under **Role**. + + e. Select the Flume role of the node where the configuration file is to be uploaded, choose **Instance Configurations** > **Import** beside the **flume.config.file**, and select the **properties.properties** file. + + .. note:: + + - An independent server configuration file can be uploaded to each Flume instance. + - This step is required for updating the configuration file. Modifying the configuration file on the background is an improper operation because the modification will be overwritten after configuration synchronization. + + f. Click **Save**, and then click **OK**. Click **Finish**. + +#. Set the client parameters of the Flume role. + + a. Run the following commands to copy the generated client certificate (**flume_cChat.jks**) and client trust list (**flume_cChatt.jks**) to the client directory, for example, **/opt/flume-client/fusionInsight-flume-1.9.0/conf/**. (The Flume client must have been installed.) **10.196.26.1** is the service plane IP address of the node where the client resides. + + **scp ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/flume_cChat.jks user@10.196.26.1:/opt/flume-client/fusionInsight-flume-1.9.0/conf/** + + **scp ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/flume_cChatt.jks user@10.196.26.1:/opt/flume-client/fusionInsight-flume-1.9.0/conf/** + + .. note:: + + When copying the client certificate, you need to enter the password of user **user** of the host (for example, **10.196.26.1**) where the client resides. + + b. Log in to the node where the Flume client is decompressed as user **user**. Run the following command to go to the client directory **opt/flume-client/fusionInsight-flume-1.9.0/bin**. + + **cd** **opt/flume-client/fusionInsight-flume-1.9.0/bin** + + c. .. _mrs_01_1069__l5265677717ab4dd5971a3b6a0d0be5f6: + + Run the following command to generate and obtain Flume client keystore password, trust list password, and keystore-password encrypted private key information. Enter the password twice and confirm the password. The password is the same as the password of the certificate whose alias is *flumechatclient* and the password of the *flume_cChat.jks* certificate library. + + **./genPwFile.sh** + + **cat password.property** + + .. note:: + + If the following error message is displayed, run the export **JAVA_HOME=\ JDK path** command. + + .. code-block:: + + JAVA_HOME is null in current user,please install the JDK and set the JAVA_HOME + + d. Run the **echo $SCC_PROFILE_DIR** command to check whether the **SCC_PROFILE_DIR** environment variable is empty. + + - If yes, run the **source .sccfile** command. + - If no, go to :ref:`3.e `. + + e. .. _mrs_01_1069__l1267a09eec45401986e9df78695f5d4c: + + Use the Flume configuration tool on FusionInsight Manager to configure the Flume role client parameters and generate a configuration file. + + #. Log in to FusionInsight Manager and choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume** > **Configuration Tool**. + + #. Set **Agent Name** to **client**. Select the source, channel, and sink to be used, drag them to the GUI on the right, and connect them. + + For example, use SpoolDir Source, File Channel, and Avro Sink. + + #. Double-click the source, channel, and sink. Set corresponding configuration parameters by seeing :ref:`Table 2 ` based on the actual environment. + + .. note:: + + - If the client parameters of the Flume role have been configured, you can obtain the existing client parameter configuration file from *client installation directory*\ **/fusioninsight-flume-1.9.0/conf/properties.properties** to ensure that the configuration is in concordance with the previous. Log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume** > **Configuration Tool** > **Import**, import the file, and modify the configuration items related to encrypted transmission. + - It is recommended that the numbers of Sources, Channels, and Sinks do not exceed 40 during configuration file import. Otherwise, the response time may be very long. + - A unique checkpoint directory needs to be configured for each File Channel. + + #. Click **Export** to save the **properties.properties** configuration file to the local. + + .. _mrs_01_1069__t231a870090124a8e8556717e6a7db11c: + + .. table:: **Table 2** Parameters to be modified of the Flume role client + + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +=======================+===================================================================================================================+===================================================================+ + | ssl | Indicates whether to enable the SSL authentication. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | keystore | Specified the client certificate. | /opt/flume-client/fusionInsight-flume-1.9.0/conf/flume_cChat.jks | + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | keystore-password | Specifies the password of the key library, which is the password required to obtain the keystore information. | ``-`` | + | | | | + | | Enter the value of password obtained in :ref:`3.c `. | | + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | truststore | Indicates the SSL certificate trust list of the client. | /opt/flume-client/fusionInsight-flume-1.9.0/conf/flume_cChatt.jks | + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | truststore-password | Specifies the trust list password, which is the password required to obtain the truststore information. | ``-`` | + | | | | + | | Enter the value of password obtained in :ref:`3.c `. | | + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + + f. Upload the **properties.properties** file to **flume/conf/** under the installation directory of the Flume client. + +#. Generate the certificate and trust list of the server and client of the MonitorServer role respectively. + + a. Log in to the host using ECM with the MonitorServer role assigned as user **omm**. + + Go to the **${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/bin** directory. + + **cd ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/bin** + + b. Run the following command to generate and export the server and client certificates of the MonitorServer role: + + **sh geneJKS.sh -m** *xxx* **-n xxx** + + The generated certificate is saved in the **${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf** path. Where: + + - **ms_sChat.jks** is the certificate library of the MonitorServer role server. **ms_sChat.crt** is the exported file of the **ms_sChat.jks** certificate. **-m** indicates the password of the certificate and certificate library. + - **ms_cChat.jks** is the certificate library of the MonitorServer role client. **ms_cChat.crt** is the exported file of the **ms_cChat.jks** certificate. **-n** indicates the password of the certificate and certificate library. + - **ms_sChatt.jks** and **ms_cChatt.jks** are the SSL certificate trust lists of the MonitorServer server and client, respectively. + +#. Set the server parameters of the MonitorServer role. + + a. .. _mrs_01_1069__l7cc74e0469cb45f4aba9974f2846c1e0: + + Run the following command to generate and obtain MonitorServer server keystore password, trust list password, and keystore-password encrypted private key information. Enter the password twice and confirm the password. The password is the same as the password of the certificate whose alias is *mschatserver* and the password of the *ms_sChat.jks* certificate library. + + **./genPwFile.sh** + + **cat password.property** + + b. Run the following command to open the ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/service/application.properties file: Modify related parameters based on the description in :ref:`Table 3 `, save the modification, and exit. + + **vi ${BIGDATA_HOME}/FusionInsight_Porter\_**\ 8.1.0.1\ **/install/FusionInsight-Flume-1.9.0/flume/conf/service/application.properties** + + .. _mrs_01_1069__tc0d290285ae94086985870f879b563c2: + + .. table:: **Table 3** Parameters to be modified of the MonitorServer role server + + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +=====================================+===========================================================================================================================================================================+==========================================================================================================+ + | ssl_need_kspasswd_decrypt_key | Specifies whether to enable the user-defined key encryption and decryption function. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_server_enable | Indicates whether to enable the SSL authentication. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_server_key_store | Set this parameter based on the specific storage location. | ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/ms_sChat.jks | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_server_trust_key_store | Set this parameter based on the specific storage location. | ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/ms_sChatt.jks | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_server_key_store_password | Indicates the client certificate password. Set this parameter based on the actual situation of certificate creation (the plaintext key used to generate the certificate). | ``-`` | + | | | | + | | Enter the value of password obtained in :ref:`5.a `. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_server_trust_key_store_password | Specifies the trustkeystore password. Set this parameter based on the actual situation of certificate creation (the plaintext key used to generate the trust list). | ``-`` | + | | | | + | | Enter the value of password obtained in :ref:`5.a `. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_need_client_auth | Indicates whether to enable the client authentication. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + + c. Restart the MonitorServer instance. Choose **Services** > **Flume** > **Instance** > **MonitorServer**, select the MonitorServer instance, and choose **More** > **Restart Instance**. Enter the system administrator password and click **OK**. After the restart is complete, click **Finish**. + +#. Set the client parameters of the MonitorServer role. + + a. Run the following commands to copy the generated client certificate (**ms_cChat.jks**) and client trust list (**ms_cChatt.jks**) to the **/opt/flume-client/fusionInsight-flume-1.9.0/conf/** client directory. **10.196.26.1** is the service plane IP address of the node where the client resides. + + **scp ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/ms_cChat.jks user@10.196.26.1:/opt/flume-client/fusionInsight-flume-1.9.0/conf/** + + **scp ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/ms_cChatt.jks user@10.196.26.1:/opt/flume-client/fusionInsight-flume-1.9.0/conf/** + + b. Log in to the node where the Flume client is located as **user**. Run the following command to go to the client directory **/opt/flume-client/fusionInsight-flume-1.9.0/bin**. + + **cd** **/opt/flume-client/fusionInsight-flume-1.9.0/bin** + + c. .. _mrs_01_1069__l252c5a768cc34fcca9cfaa5a90dfe8c0: + + Run the following command to generate and obtain MonitorServer client keystore password, trust list password, and keystore-password encrypted private key information. Enter the password twice and confirm the password. The password is the same as the password of the certificate whose alias is *mschatclient* and the password of the *ms_cChat.jks* certificate library. + + **./genPwFile.sh** + + **cat password.property** + + d. Run the following command to open the **/opt/flume-client/fusionInsight-flume-1.9.0/conf/service/application.properties** file. (**/opt/flume-client/fusionInsight-flume-1.9.0** is the directory where the client software is installed.) Modify related parameters based on the description in :ref:`Table 4 `, save the modification, and exit. + + **vi** **/opt/flume-client/fusionInsight-flume-1.9.0/flume/conf/service/application.properties** + + .. _mrs_01_1069__tea1b721973a843b7891ab85f51d2f2e6: + + .. table:: **Table 4** Parameters to be modified of the MonitorServer role client + + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +=====================================+=====================================================================================================================================================================+==========================================================================================================+ + | ssl_need_kspasswd_decrypt_key | Indicates whether to enable the user-defined key encryption and decryption function. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_client_enable | Indicates whether to enable the SSL authentication. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_client_key_store | Set this parameter based on the specific storage location. | ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/ms_cChat.jks | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_client_trust_key_store | Set this parameter based on the specific storage location. | ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/ms_cChatt.jks | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_client_key_store_password | Specifies the keystore password. Set this parameter based on the actual situation of certificate creation (the plaintext key used to generate the certificate). | ``-`` | + | | | | + | | Enter the value of **password** obtained in :ref:`6.c `. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_client_trust_key_store_password | Specifies the trustkeystore password. Set this parameter based on the actual situation of certificate creation (the plaintext key used to generate the trust list). | ``-`` | + | | | | + | | Enter the value of **password** obtained in :ref:`6.c `. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_need_client_auth | Indicates whether to enable the client authentication. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_flume/encrypted_transmission/index.rst b/doc/component-operation-guide/source/using_flume/encrypted_transmission/index.rst new file mode 100644 index 0000000..dff45c0 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/encrypted_transmission/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_1068.html + +.. _mrs_01_1068: + +Encrypted Transmission +====================== + +- :ref:`Configuring the Encrypted Transmission ` +- :ref:`Typical Scenario: Collecting Local Static Logs and Uploading Them to HDFS ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + configuring_the_encrypted_transmission + typical_scenario_collecting_local_static_logs_and_uploading_them_to_hdfs diff --git a/doc/component-operation-guide/source/using_flume/encrypted_transmission/typical_scenario_collecting_local_static_logs_and_uploading_them_to_hdfs.rst b/doc/component-operation-guide/source/using_flume/encrypted_transmission/typical_scenario_collecting_local_static_logs_and_uploading_them_to_hdfs.rst new file mode 100644 index 0000000..b4e29ad --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/encrypted_transmission/typical_scenario_collecting_local_static_logs_and_uploading_them_to_hdfs.rst @@ -0,0 +1,383 @@ +:original_name: mrs_01_1070.html + +.. _mrs_01_1070: + +Typical Scenario: Collecting Local Static Logs and Uploading Them to HDFS +========================================================================= + +Scenario +-------- + +This section describes how to use Flume to collect static logs from a local host and save them to the **/flume/test** directory on HDFS. + +This section applies to MRS 3.\ *x* or later clusters. + +Prerequisites +------------- + +- The cluster, HDFS and Flume services, and Flume client have been installed. +- User **flume_hdfs** has been created, and the HDFS directory and data used for log verification have been authorized to the user. + +Procedure +--------- + +#. Generate the certificate trust lists of the server and client of the Flume role respectively. + + a. Log in to the node where the Flume server is located as user **omm**. Go to the **${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/bin** directory. + + **cd ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/bin** + + b. Run the following command to generate and export the server and client certificates of the Flume role: + + **sh geneJKS.sh -f** *Password* **-g** *Password* + + The generated certificate is saved in the **${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf** path . + + - **flume_sChat.jks** is the certificate library of the Flume role server. **flume_sChat.crt** is the exported file of the **flume_sChat.jks** certificate. **-f** indicates the password of the certificate and certificate library. + - **flume_cChat.jks** is the certificate library of the Flume role client. **flume_cChat.crt** is the exported file of the **flume_cChat.jks** certificate. **-g** indicates the password of the certificate and certificate library. + - **flume_sChatt.jks** and **flume_cChatt.jks** are the SSL certificate trust lists of the Flume server and client, respectively. + + .. note:: + + All user-defined passwords involved in this section must meet the following requirements: + + - Contain at least four types of the following: uppercase letters, lowercase letters, digits, and special characters. + - Contain at least eight characters and a maximum of 64 characters. + - It is recommended that the user-defined passwords be changed periodically (for example, every three months), and certificates and trust lists be generated again to ensure security. + +#. On FusionInsight Manager, choose **System > User** and choose **More > Download Authentication Credential** to download the Kerberos certificate file of user **flume_hdfs** and save it to the local host. +#. Configure the server parameters of the Flume role and upload the configuration file to the cluster. + + a. Log in to any node where the Flume role is located as user **omm**. Run the following command to go to the ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/bin directory: + + **cd ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/bin** + + b. .. _mrs_01_1070__lf43fc3e7d9364ddb9e475908dc382fc9: + + Run the following command to generate and obtain Flume server keystore password, trust list password, and keystore-password encrypted private key information. Enter the password twice and confirm the password. It is the password of the **flume_sChat.jks** certificate library. + + **./genPwFile.sh** + + **cat password.property** + + c. Use the Flume configuration tool on the FusionInsight Manager portal to configure the server parameters and generate the configuration file. + + #. Log in to FusionInsight Manager and choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume** > **Configuration Tool**. + + #. Set **Agent Name** to **server**. Select the source, channel, and sink to be used, drag them to the GUI on the right, and connect them. + + For example, use SpoolDir Source, File Channel, and HDFS Sink. + + #. Double-click the source, channel, and sink. Set corresponding configuration parameters by seeing :ref:`Table 1 ` based on the actual environment. + + .. note:: + + - If the server parameters of the Flume role have been configured, you can choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume** > **Instance** on FusionInsight Manager. Then select the corresponding Flume role instance and click the **Download** button behind the **flume.config.file** parameter on the **Instance Configurations** page to obtain the existing server parameter configuration file. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume** > **Configuration Tool** > **Import**, import the file, and modify the configuration items related to encrypted transmission. + - It is recommended that the numbers of Sources, Channels, and Sinks do not exceed 40 during configuration file import. Otherwise, the response time may be very long. + - A unique checkpoint directory needs to be configured for each File Channel. + + #. Click **Export** to save the **properties.properties** configuration file to the local. + + .. _mrs_01_1070__t90702710f4c74f1ea2a89064a9507879: + + .. table:: **Table 1** Parameters to be modified of the Flume role server + + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +========================+=========================================================================================================================================================================================================================================================================================================================================================================================================================================================+============================================================================================================================================================================================================================================+ + | Name | The value must be unique and cannot be left blank. | test | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | bind | Specifies the IP address to which Avro Source is bound. This parameter cannot be left blank. It must be configured as the IP address that the server configuration file will upload. | 192.168.108.11 | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | port | Specifies the IP address to which Avro Source is bound. This parameter cannot be left blank. It must be configured as an unused port. | 21154 | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ssl | Indicates whether to enable the SSL authentication. (You are advised to enable this function to ensure security.) | true | + | | | | + | | Only Sources of the Avro type have this configuration item. | | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore | Indicates the server certificate. | ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/flume_sChat.jks | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore-password | Specifies the password of the key library, which is the password required to obtain the keystore information. | ``-`` | + | | | | + | | Enter the value of **password** obtained in :ref:`3.b `. | | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore | Indicates the SSL certificate trust list of the server. | ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/flume_sChatt.jks | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore-password | Specifies the trust list password, which is the password required to obtain the truststore information. | ``-`` | + | | | | + | | Enter the value of **password** obtained in :ref:`3.b `. | | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | dataDirs | Specifies the directory for storing buffer data. The run directory is used by default. Configuring multiple directories on disks can improve transmission efficiency. Use commas (,) to separate multiple directories. If the directory is inside the cluster, the **/srv/BigData/hadoop/dataX/flume/data** directory can be used. **dataX** ranges from data1 to dataN. If the directory is outside the cluster, it needs to be independently planned. | /srv/BigData/hadoop/data1/flumeserver/data | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | checkpointDir | Specifies the directory for storing the checkpoint information, which is under the run directory by default. If the directory is inside the cluster, the **/srv/BigData/hadoop/dataX/flume/checkpoint** directory can be used. **dataX** ranges from data1 to dataN. If the directory is outside the cluster, it needs to be independently planned. | /srv/BigData/hadoop/data1/flumeserver/checkpoint | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | transactionCapacity | Specifies the transaction size, that is, the number of events in a transaction that can be processed by the current Channel. The size cannot be smaller than the batchSize of Source. Setting the same size as batchSize is recommended. | 61200 | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.path | Specifies the HDFS data write directory. This parameter cannot be left blank. | hdfs://hacluster/flume/test | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.inUsePrefix | Specifies the prefix of the file that is being written to HDFS. | TMP\_ | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.batchSize | Specifies the maximum number of events that can be written to HDFS once. | 61200 | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosPrincipal | Specifies the Kerberos authentication user, which is mandatory in security versions. This configuration is required only in security clusters. | flume_hdfs | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosKeytab | Specifies the keytab file path for Kerberos authentication, which is mandatory in security versions. This configuration is required only in security clusters. | /opt/test/conf/user.keytab | + | | | | + | | | .. note:: | + | | | | + | | | Obtain the **user.keytab** file from the Kerberos certificate file of the user **flume_hdfs**. In addition, ensure that the user who installs and runs the Flume client has the read and write permissions on the **user.keytab** file. | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.useLocalTimeStamp | Specifies whether to use the local time. Possible values are **true** and **false**. | true | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + d. Log in to FusionInsight Manager and choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume**. On the displayed page, click the **Flume** role under **Role**. + + e. Select the Flume role of the node where the configuration file is to be uploaded, choose **Instance Configurations** > **Import** beside the **flume.config.file**, and select the **properties.properties** file. + + .. note:: + + - An independent server configuration file can be uploaded to each Flume instance. + - This step is required for updating the configuration file. Modifying the configuration file on the background is an improper operation because the modification will be overwritten after configuration synchronization. + + f. Click **Save**, and then click **OK**. + + g. Click **Finish**. + +#. Configure the client parameters of the Flume role. + + a. Run the following commands to copy the generated client certificate (**flume_cChat.jks**) and client trust list (**flume_cChatt.jks**) to the client directory, for example, **/opt/flume-client/fusionInsight-flume-1.9.0/conf/**. (The Flume client must have been installed.) **10.196.26.1** is the service plane IP address of the node where the client resides. + + **scp ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/flume_cChat.jks user@10.196.26.1:/opt/flume-client/fusionInsight-flume-1.9.0/conf/** + + **scp ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/flume_cChatt.jks user@10.196.26.1:/opt/flume-client/fusionInsight-flume-1.9.0/conf/** + + .. note:: + + When copying the client certificate, you need to enter the password of user **user** of the host (for example, **10.196.26.1**) where the client resides. + + b. Log in to the node where the Flume client is decompressed as user **user**. Run the following command to go to the client directory **/opt/flume-client/fusionInsight-flume-1.9.0/bin**. + + **cd** opt/flume-client/fusionInsight-flume-1.9.0/bin + + c. .. _mrs_01_1070__lf5cdb5eca44842caac47a27a09a4e206: + + Run the following command to generate and obtain Flume client keystore password, trust list password, and keystore-password encrypted private key information. Enter the password twice and confirm the password. The password is the same as the password of the certificate whose alias is *flumechatclient* and the password of the *flume_cChat.jks* certificate library. + + **./genPwFile.sh** + + **cat password.property** + + .. note:: + + If the following error message is displayed, run the export **JAVA_HOME=\ JDKpath** command. + + .. code-block:: + + JAVA_HOME is null in current user,please install the JDK and set the JAVA_HOME + + d. Run the **echo $SCC_PROFILE_DIR** command to check whether the **SCC_PROFILE_DIR** environment variable is empty. + + - If yes, run the **source .sccfile** command. + - If no, go to :ref:`4.e `. + + e. .. _mrs_01_1070__l5e1fd98241304e658764a6bdc63d7299: + + Use the Flume configuration tool on FusionInsight Manager to configure the Flume role client parameters and generate a configuration file. + + #. Log in to FusionInsight Manager and choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume** > **Configuration Tool**. + + #. Set **Agent Name** to **client**. Select the source, channel, and sink to be used, drag them to the GUI on the right, and connect them. + + Use SpoolDir Source, File Channel, and HDFS Sink. + + #. Double-click the source, channel, and sink. Set corresponding configuration parameters by seeing :ref:`Table 2 ` based on the actual environment. + + .. note:: + + - If the client parameters of the Flume role have been configured, you can obtain the existing client parameter configuration file from *client installation directory*\ **/fusioninsight-flume-1.9.0/conf/properties.properties** to ensure that the configuration is in concordance with the previous. Log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume** > **Configuration Tool** > **Import**, import the file, and modify the configuration items related to encrypted transmission. + - It is recommended that the numbers of Sources, Channels, and Sinks do not exceed 40 during configuration file import. Otherwise, the response time may be very long. + + #. Click **Export** to save the **properties.properties** configuration file to the local. + + .. _mrs_01_1070__t4e49dd595a71448eb33a418332772306: + + .. table:: **Table 2** Parameters to be modified of the Flume role client + + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +=======================+=========================================================================================================================================================================================================================================================================================================================================================================================================================================================+===================================================================+ + | Name | The value must be unique and cannot be left blank. | test | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | spoolDir | Specifies the directory where the file to be collected resides. This parameter cannot be left blank. The directory needs to exist and have the write, read, and execute permissions on the flume running user. | /srv/BigData/hadoop/data1/zb | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | trackerDir | Specifies the path for storing the metadata of files collected by Flume. | /srv/BigData/hadoop/data1/tracker | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | batch-size | Specifies the number of events that Flume sends in a batch. | 61200 | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | dataDirs | Specifies the directory for storing buffer data. The run directory is used by default. Configuring multiple directories on disks can improve transmission efficiency. Use commas (,) to separate multiple directories. If the directory is inside the cluster, the **/srv/BigData/hadoop/dataX/flume/data** directory can be used. **dataX** ranges from data1 to dataN. If the directory is outside the cluster, it needs to be independently planned. | /srv/BigData/hadoop/data1/flume/data | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | checkpointDir | Specifies the directory for storing the checkpoint information, which is under the run directory by default. If the directory is inside the cluster, the **/srv/BigData/hadoop/dataX/flume/checkpoint** directory can be used. **dataX** ranges from data1 to dataN. If the directory is outside the cluster, it needs to be independently planned. | /srv/BigData/hadoop/data1/flume/checkpoint | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | transactionCapacity | Specifies the transaction size, that is, the number of events in a transaction that can be processed by the current Channel. The size cannot be smaller than the batchSize of Source. Setting the same size as batchSize is recommended. | 61200 | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | hostname | Specifies the name or IP address of the host whose data is to be sent. This parameter cannot be left blank. Name or IP address must be configured to be the name or IP address that the Avro source associated with it. | 192.168.108.11 | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | port | Specifies the IP address to which Avro Sink is bound. This parameter cannot be left blank. It must be consistent with the port that is monitored by the connected Avro Source. | 21154 | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | ssl | Specifies whether to enable the SSL authentication. (You are advised to enable this function to ensure security.) | true | + | | | | + | | Only Sources of the Avro type have this configuration item. | | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | keystore | Specifies the **flume_cChat.jks** certificate generated on the server. | /opt/flume-client/fusionInsight-flume-1.9.0/conf/flume_cChat.jks | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | keystore-password | Specifies the password of the key library, which is the password required to obtain the keystore information. | ``-`` | + | | | | + | | Enter the value of **password** obtained in :ref:`4.c `. | | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | truststore | Indicates the SSL certificate trust list of the server. | /opt/flume-client/fusionInsight-flume-1.9.0/conf/flume_cChatt.jks | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + | truststore-password | Specifies the trust list password, which is the password required to obtain the truststore information. | ``-`` | + | | | | + | | Enter the value of **password** obtained in :ref:`4.c `. | | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------+ + + f. Upload the **properties.properties** file to **flume/conf/** under the installation directory of the Flume client. + +#. Generate the certificate and trust list of the server and client of the MonitorServer role respectively. + + a. Log in to the host with the MonitorServer role assigned as user **omm**. + + Go to the **${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/bin** directory. + + **cd ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/bin** + + b. Run the following command to generate and export the server and client certificates of the MonitorServer role: + + **sh geneJKS.sh -m** *Password* **-n** *Password* + + The generated certificate is saved in the **${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf** path. Where: + + - **ms_sChat.jks** is the certificate library of the MonitorServer role server. **ms_sChat.crt** is the exported file of the **ms_sChat.jks** certificate. **-m** indicates the password of the certificate and certificate library. + - **ms_cChat.jks** is the certificate library of the MonitorServer role client. **ms_cChat.crt** is the exported file of the **ms_cChat.jks** certificate. **-n** indicates the password of the certificate and certificate library. + - **ms_sChatt.jks** and **ms_cChatt.jks** are the SSL certificate trust lists of the MonitorServer server and client, respectively. + +#. Set the server parameters of the MonitorServer role. + + a. .. _mrs_01_1070__la6ea6d1571ea4b2a94b3c942a18144db: + + Run the following command to generate and obtain MonitorServer server keystore password, trust list password, and keystore-password encrypted private key information. Enter the password twice and confirm the password. The password is the same as the password of the certificate whose alias is *mschatserver* and the password of the *ms_sChat.jks* certificate library. + + **./genPwFile.sh** + + **cat password.property** + + b. Run the following command to open the ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/service/application.properties file: Modify related parameters based on the description in :ref:`Table 3 `, save the modification, and exit. + + **vi ${BIGDATA_HOME}/FusionInsight_Porter\_**\ 8.1.0.1\ **/install/FusionInsight-Flume-1.9.0/flume/conf/service/application.properties** + + .. _mrs_01_1070__tc32e0ef5ae504791afb953e98354efa7: + + .. table:: **Table 3** Parameters to be modified of the MonitorServer role server + + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +=====================================+===========================================================================================================================================================================+==========================================================================================================+ + | ssl_need_kspasswd_decrypt_key | Indicates whether to enable the user-defined key encryption and decryption function. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_server_enable | Indicates whether to enable the SSL authentication. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_server_key_store | Set this parameter based on the specific storage location. | ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/ms_sChat.jks | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_server_trust_key_store | Set this parameter based on the specific storage location. | ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/ms_sChatt.jks | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_server_key_store_password | Indicates the client certificate password. Set this parameter based on the actual situation of certificate creation (the plaintext key used to generate the certificate). | ``-`` | + | | | | + | | Enter the value of **password** obtained in :ref:`6.a `. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_server_trust_key_store_password | Indicates the client trust list password. Set this parameter based on the actual situation of certificate creation (the plaintext key used to generate the trust list). | ``-`` | + | | | | + | | Enter the value of **password** obtained in :ref:`6.a `. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_need_client_auth | Indicates whether to enable the client authentication. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + + c. Restart the MonitorServer instance. Choose **Cluster >** *Name of the desired cluster* **> Services > Flume > Instance > MonitorServer**, select the configured MonitorServer instance, and choose **More > Restart Instance**. Enter the system administrator password and click **OK**. After the restart is complete, click **Finish**. + +#. Set the client parameters of the MonitorServer role. + + a. Run the following commands to copy the generated client certificate (**ms_cChat.jks**) and client trust list (**ms_cChatt.jks**) to the **/opt/flume-client/fusionInsight-flume-1.9.0/conf/** client directory. **10.196.26.1** is the service plane IP address of the node where the client resides. + + **scp ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/ms_cChat.jks user@10.196.26.1:/opt/flume-client/fusionInsight-flume-1.9.0/conf/** + + **scp ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/ms_cChatt.jks user@10.196.26.1:/opt/flume-client/fusionInsight-flume-1.9.0/conf/** + + b. Log in to the node where the Flume client is located as user **user**. Run the following command to go to the client directory **/opt/flume-client/fusionInsight-flume-1.9.0/bin**. + + **cd** /opt/flume-client/fusionInsight-flume-1.9.0/bin + + c. .. _mrs_01_1070__l6c040d3a99c04a7d87c53e59bafe8394: + + Run the following command to generate and obtain MonitorServer client keystore password, trust list password, and keystore-password encrypted private key information. Enter the password twice and confirm the password. The password is the same as the password of the certificate whose alias is *mschatclient* and the password of the *ms_cChat.jks* certificate library. + + **./genPwFile.sh** + + **cat password.property** + + d. Run the following command to open the **/opt/flume-client/fusionInsight-flume-1.9.0/conf/service/application.properties** file. (**/opt/flume-client/fusionInsight-flume-1.9.0** is the directory where the client is installed.) Modify related parameters based on the description in :ref:`Table 4 `, save the modification, and exit. + + **vi** **/opt/flume-client/fusionInsight-flume-1.9.0/conf/service/application.properties** + + .. _mrs_01_1070__ta0130cca376a4aaf833fa310a2e59e9d: + + .. table:: **Table 4** Parameters to be modified of the MonitorServer role client + + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +=====================================+=====================================================================================================================================================================+==========================================================================================================+ + | ssl_need_kspasswd_decrypt_key | Indicates whether to enable the user-defined key encryption and decryption function. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_client_enable | Indicates whether to enable the SSL authentication. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_client_key_store | Set this parameter based on the specific storage location. | ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/ms_cChat.jks | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_client_trust_key_store | Set this parameter based on the specific storage location. | ${BIGDATA_HOME}/FusionInsight_Porter\_8.1.0.1/install/FusionInsight-Flume-1.9.0/flume/conf/ms_cChatt.jks | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_client_key_store_password | Specifies the keystore password. Set this parameter based on the actual situation of certificate creation (the plaintext key used to generate the certificate). | ``-`` | + | | | | + | | Enter the value of **password** obtained in :ref:`7.c `. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_client_trust_key_store_password | Specifies the trustkeystore password. Set this parameter based on the actual situation of certificate creation (the plaintext key used to generate the trust list). | ``-`` | + | | | | + | | Enter the value of **password** obtained in :ref:`7.c `. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + | ssl_need_client_auth | Indicates whether to enable the client authentication. (You are advised to enable this function to ensure security.) | true | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+ + +8. Verify log transmission. + + a. Log in to FusionInsight Manager as a user who has the management permission on HDFS. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Choose **Cluster >** *Name of the desired cluster* > **Services** > **HDFS**, click the HDFS WebUI link to go to the HDFS WebUI, and choose **Utilities** > **Browse the file system**. + b. Check whether the data is generated in the **/flume/test** directory on the HDFS. diff --git a/doc/component-operation-guide/source/using_flume/flume_client_cgroup_usage_guide.rst b/doc/component-operation-guide/source/using_flume/flume_client_cgroup_usage_guide.rst new file mode 100644 index 0000000..741b122 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/flume_client_cgroup_usage_guide.rst @@ -0,0 +1,51 @@ +:original_name: mrs_01_1082.html + +.. _mrs_01_1082: + +Flume Client Cgroup Usage Guide +=============================== + +Scenario +-------- + +This section describes how to join and log out of a cgroup, query the cgroup status, and change the cgroup CPU threshold. + +This section applies to MRS 3.\ *x* or later. + +Procedure +--------- + +- **Join Cgroup** + + Assume that the Flume client installation path is **/opt/FlumeClient**, and the cgroup CPU threshold is 50%. Run the following command to join a cgroup: + + **cd /opt/FlumeClient/fusioninsight-flume-1.9.0/bin** + + **./flume-manage.sh cgroup join 50** + + .. note:: + + - This command can be used to join a cgroup and change the cgroup CPU threshold. + - The value of the CPU threshold of a cgroup ranges from 1 to 100 x *N*. *N* indicates the number of CPU cores. + +- **Check Cgroup status** + + Assume that the Flume client installation path is **/opt/FlumeClient**. Run the following commands to query the cgroup status: + + **cd /opt/FlumeClient/fusioninsight-flume-1.9.0/bin** + + **./flume-manage.sh cgroup status** + +- **Exit Cgroup** + + Assume that the Flume client installation path is **/opt/FlumeClient**. Run the following commands to exit cgroup: + + **cd /opt/FlumeClient/fusioninsight-flume-1.9.0/bin** + + **./flume-manage.sh cgroup exit** + + .. note:: + + - After the client is installed, the default cgroup is automatically created. If the **-s** parameter is not configured during client installation, the default value **-1** is used. The default value indicates that the agent process is not restricted by the CPU usage. + - Joining or exiting a cgroup does not affect the agent process. Even if the agent process is not started, the joining or exiting operation can be performed successfully, and the operation will take effect after the next startup of the agent process. + - After the client is uninstalled, the cgroups created during the client installation are automatically deleted. diff --git a/doc/component-operation-guide/source/using_flume/flume_configuration_parameter_description.rst b/doc/component-operation-guide/source/using_flume/flume_configuration_parameter_description.rst new file mode 100644 index 0000000..adf206d --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/flume_configuration_parameter_description.rst @@ -0,0 +1,512 @@ +:original_name: mrs_01_0396.html + +.. _mrs_01_0396: + +Flume Configuration Parameter Description +========================================= + +For versions earlier than MRS 3.x, configure Flume parameters in the **properties.properties** file. + +For MRS 3.x or later, some parameters can be configured on Manager. + +Overview +-------- + +This section describes how to configure the sources, channels, and sinks of Flume, and modify the configuration items of each module. + +For MRS 3.x or later, log in to FusionInsight Manager and choose **Cluster** > **Services** > **Flume**. On the displayed page, click the **Configuration Tool** tab, select and drag the source, channel, and sink to be used to the GUI on the right, and double-click them to configure corresponding parameters. Parameters such as **channels** and **type** are configured only in the client configuration file **properties.properties**, the path of which is *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume version*\ **/conf/properties.properties**. + +.. note:: + + You must input encrypted information for some configurations. For details on how to encrypt information, see :ref:`Using the Encryption Tool of the Flume Client `. + +Common Source Configurations +---------------------------- + +- **Avro Source** + + An Avro source listens to the Avro port, receives data from the external Avro client, and places data into configured channels. :ref:`Table 1 ` lists common configurations. + + .. _mrs_01_0396__tb4f2cc56cf3945f7bba26f90f1afaa79: + + .. table:: **Table 1** Common configurations of an Avro source + + +-----------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=======================+=======================+=======================================================================================================================================================================================+ + | channels | **-** | Specifies the channel connected to the source. Multiple channels can be configured. Use spaces to separate them. | + | | | | + | | | In a single proxy process, sources and sinks are connected through channels. A source instance corresponds to multiple channels, but a sink instance corresponds only to one channel. | + | | | | + | | | The format is as follows: | + | | | | + | | | **.sources..channels = ...** | + | | | | + | | | **.sinks..channels = ** | + | | | | + | | | This parameter can be configured only in the **properties.properties** file. | + +-----------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | avro | Specifies the type, which is set to **avro**. The type of each source is a fixed value. | + | | | | + | | | This parameter can be configured only in the **properties.properties** file. | + +-----------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | bind | ``-`` | Specifies the host name or IP address associated with the source. | + +-----------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | port | ``-`` | Specifies the bound port number. | + +-----------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ssl | false | Specifies whether to use SSL encryption. | + | | | | + | | | - true | + | | | - false | + +-----------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore-type | JKS | Specifies the Java trust store type. Set this parameter to **JKS** or other truststore types supported by Java. | + +-----------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore | ``-`` | Specifies the Java trust store file. | + +-----------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore-password | ``-`` | Specifies the Java trust store password. | + +-----------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore-type | JKS | Specifies the key storage type. Set this parameter to **JKS** or other truststore types supported by Java. | + +-----------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore | ``-`` | Specifies the key storage file. | + +-----------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore-password | ``-`` | Specifies the key storage password. | + +-----------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **SpoolDir Source** + + A SpoolDir source monitors and transmits new files that have been added to directories in quasi-real-time mode. Common configurations are as follows: + + .. table:: **Table 2** Common configurations of a SpoolDir source + + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +============================+=======================+==========================================================================================================================================================================================================================================================+ + | channels | ``-`` | Specifies the channel connected to the source. Multiple channels can be configured. | + | | | | + | | | This parameter can be configured only in the **properties.properties** file. | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | spooldir | Type, which is set to **spooldir**. | + | | | | + | | | This parameter can be configured only in the **properties.properties** file. | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | monTime | 0 (Disabled) | Specifies the thread monitoring threshold. When the update time exceeds the threshold, the source is restarted. Unit: second | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spoolDir | ``-`` | Specifies the monitoring directory. | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | fileSuffix | .COMPLETED | Specifies the suffix added after file transmission is complete. | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | deletePolicy | never | Specifies the source file deletion policy after file transmission is complete. The value can be either **never** or **immediate**. | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ignorePattern | ^$ | Specifies the regular expression of a file to be ignored. | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | trackerDir | .flumespool | Specifies the metadata storage path during data transmission. | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchSize | 1000 | Specifies the source transmission granularity. | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | decodeErrorPolicy | FAIL | Specifies the code error policy. This parameter can be configured only in the **properties.properties** file. | + | | | | + | | | The value can be **FAIL**, **REPLACE**, or **IGNORE**. | + | | | | + | | | **FAIL**: Generate an exception and fail the parsing. | + | | | | + | | | **REPLACE**: Replace the characters that cannot be identified with other characters, such as U+FFFD. | + | | | | + | | | **IGNORE**: Discard character strings that cannot be parsed. | + | | | | + | | | .. note:: | + | | | | + | | | If a code error occurs in the file, set **decodeErrorPolicy** to **REPLACE** or **IGNORE**. Flume will skip the code error and continue to collect subsequent logs. | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | deserializer | LINE | Specifies the file parser. The value can be either **LINE** or **BufferedLine**. | + | | | | + | | | - When the value is set to **LINE**, characters read from the file are transcoded one by one. | + | | | - When the value is set to **BufferedLine**, one line or multiple lines of characters read from the file are transcoded in batches, which delivers better performance. | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | deserializer.maxLineLength | 2048 | Specifies the maximum length for resolution by line, ranging from 0 to 2,147,483,647. | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | deserializer.maxBatchLine | 1 | Specifies the maximum number of lines for resolution by line. If multiple lines are set, **maxLineLength** must be set to a corresponding multiplier. For example, if **maxBatchLine** is set to **2**, **maxLineLength** is set to **4096** (2048 x 2). | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | selector.type | replicating | Specifies the selector type. The value can be either **replicating** or **multiplexing**. | + | | | | + | | | - **replicating** indicates that the same content is sent to each channel. | + | | | - **multiplexing** indicates that the content is sent only to certain channels according to the distribution rule. | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | interceptors | ``-`` | Specifies the interceptor. For details, see the `Flume official document `__. | + | | | | + | | | This parameter can be configured only in the **properties.properties** file. | + +----------------------------+-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + The Spooling source ignores the last line feed character of each event when data is read by line. Therefore, Flume does not calculate the data volume counters used by the last line feed character. + +- **Kafka Source** + + A Kafka source consumes data from Kafka topics. Multiple sources can consume data of the same topic, and the sources consume different partitions of the topic. Common configurations are as follows: + + .. table:: **Table 3** Common configurations of a Kafka source + + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=================================+===========================================+====================================================================================================================================================================================+ + | channels | ``-`` | Specifies the channel connected to the source. Multiple channels can be configured. | + | | | | + | | | This parameter can be configured only in the **properties.properties** file. | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | org.apache.flume.source.kafka.KafkaSource | Specifies the type, which is set to **org.apache.flume.source.kafka.KafkaSource**. | + | | | | + | | | This parameter can be configured only in the **properties.properties** file. | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | monTime | 0 (Disabled) | Specifies the thread monitoring threshold. When the update time exceeds the threshold, the source is restarted. Unit: second | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | nodatatime | 0 (Disabled) | Specifies the alarm threshold. An alarm is triggered when the duration that Kafka does not release data to subscribers exceeds the threshold. Unit: second | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchSize | 1000 | Specifies the number of events written into a channel at a time. | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchDurationMillis | 1000 | Specifies the maximum duration of topic data consumption at a time, expressed in milliseconds. | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keepTopicInHeader | false | Specifies whether to save topics in the event header. If topics are saved, topics configured in Kafka sinks become invalid. | + | | | | + | | | - true | + | | | - false | + | | | | + | | | This parameter can be configured only in the **properties.properties** file. | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keepPartitionInHeader | false | Specifies whether to save partition IDs in the event header. If partition IDs are saved, Kafka sinks write data to the corresponding partitions. | + | | | | + | | | - true | + | | | - false | + | | | | + | | | This parameter can be set only in the properties.properties file. | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.bootstrap.servers | ``-`` | Specifies the list of Broker addresses, which are separated by commas. | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.consumer.group.id | ``-`` | Specifies the Kafka consumer group ID. | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.topics | ``-`` | Specifies the list of subscribed Kafka topics, which are separated by commas (,). | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.topics.regex | ``-`` | Specifies the subscribed topics that comply with regular expressions. **kafka.topics.regex** has a higher priority than **kafka.topics** and will overwrite **kafka.topics**. | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.security.protocol | SASL_PLAINTEXT | Specifies the security protocol of Kafka. The value must be set to **PLAINTEXT** for clusters in which Kerberos authentication is disabled. | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.kerberos.domain.name | ``-`` | Specifies the value of **default_realm** of Kerberos in the Kafka cluster, which should be configured only for security clusters. | + | | | | + | | | This parameter can be set only in the properties.properties file. | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Other Kafka Consumer Properties | ``-`` | Specifies other Kafka configurations. This parameter can be set to any consumption configuration supported by Kafka, and the **.kafka** prefix must be added to the configuration. | + | | | | + | | | This parameter can be set only in the properties.properties file. | + +---------------------------------+-------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Taildir Source** + + A Taildir source monitors file changes in a directory and automatically reads the file content. In addition, it can transmit data in real time. :ref:`Table 4 ` lists common configurations. + + .. _mrs_01_0396__t2c85090722c4451682fad2657a7bdc35: + + .. table:: **Table 4** Common configurations of a Taildir source + + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +========================================+=======================+==============================================================================================================================================================================================================================================================+ + | channels | ``-`` | Specifies the channel connected to the source. Multiple channels can be configured. | + | | | | + | | | This parameter can be set only in the properties.properties file. | + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | taildir | Specifies the type, which is set to **taildir**. | + | | | | + | | | This parameter can be set only in the properties.properties file. | + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | filegroups | ``-`` | Specifies the group name of a collection file directory. Group names are separated by spaces. | + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | filegroups..parentDir | ``-`` | Specifies the parent directory. The value must be an absolute path. | + | | | | + | | | This parameter can be set only in the properties.properties file. | + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | filegroups..filePattern | ``-`` | Specifies the relative file path of the file group's parent directory. Directories can be included and regular expressions are supported. It must be used together with **parentDir**. | + | | | | + | | | This parameter can be set only in the properties.properties file. | + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | positionFile | ``-`` | Specifies the metadata storage path during data transmission. | + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | headers.. | ``-`` | Specifies the key-value of an event when data of a group is being collected. | + | | | | + | | | This parameter can be set only in the properties.properties file. | + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | byteOffsetHeader | false | Specifies whether each event header should contain the location information about the event in the source file. The location information is saved in the **byteoffset** variable. | + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | skipToEnd | false | Specifies whether Flume can locate the latest location of a file and read the latest data after restart. | + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | idleTimeout | 120000 | Specifies the idle duration during file reading, expressed in milliseconds. If the file data is not changed in this idle period, the source closes the file. If data is written into this file after it is closed, the source opens the file and reads data. | + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | writePosInterval | 3000 | Specifies the interval for writing metadata to a file, expressed in milliseconds. | + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchSize | 1000 | Specifies the number of events written to the channel in batches. | + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | monTime | 0 (Disabled) | Specifies the thread monitoring threshold. When the update time exceeds the threshold, the source is restarted. Unit: second | + +----------------------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Http Source** + + An HTTP source receives data from an external HTTP client and sends the data to the configured channels. :ref:`Table 5 ` lists common configurations. + + .. _mrs_01_0396__t033eef1276424185b1cfd10a7d4e024f: + + .. table:: **Table 5** Common configurations of an HTTP source + + +-----------------------+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=======================+==========================================+=======================================================================================================================================================+ + | channels | ``-`` | Specifies the channel connected to the source. Multiple channels can be configured. This parameter can be set only in the properties.properties file. | + +-----------------------+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | http | Specifies the type, which is set to **http**. This parameter can be set only in the properties.properties file. | + +-----------------------+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ + | bind | ``-`` | Specifies the name or IP address of the bound host. | + +-----------------------+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ + | port | ``-`` | Specifies the bound port. | + +-----------------------+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ + | handler | org.apache.flume.source.http.JSONHandler | Specifies the message parsing method of an HTTP request. The following methods are supported: | + | | | | + | | | - **org.apache.flume.source.http.JSONHandler**: JSON | + | | | - **org.apache.flume.sink.solr.morphline.BlobHandler**: BLOB | + +-----------------------+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ + | handler.\* | ``-`` | Specifies handler parameters. | + +-----------------------+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ + | enableSSL | false | Specifies whether SSL is enabled in HTTP. | + +-----------------------+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore | ``-`` | Specifies the keystore path set after SSL is enabled in HTTP. | + +-----------------------+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystorePassword | ``-`` | Specifies the keystore password set after SSL is enabled in HTTP. | + +-----------------------+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Common Channel Configurations +----------------------------- + +- **Memory Channel** + + A memory channel uses memory as the cache. Events are stored in memory queues. :ref:`Table 6 ` lists common configurations. + + .. _mrs_01_0396__tc1421df5bc6c415ca490e671ea935f85: + + .. table:: **Table 6** Common configurations of a memory channel + + +---------------------+---------------+-------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=====================+===============+===================================================================================================================+ + | type | ``-`` | Specifies the type, which is set to **memory**. This parameter can be set only in the properties.properties file. | + +---------------------+---------------+-------------------------------------------------------------------------------------------------------------------+ + | capacity | 10000 | Specifies the maximum number of events cached in a channel. | + +---------------------+---------------+-------------------------------------------------------------------------------------------------------------------+ + | transactionCapacity | 1000 | Specifies the maximum number of events accessed each time. | + +---------------------+---------------+-------------------------------------------------------------------------------------------------------------------+ + | channelfullcount | 10 | Specifies the channel full count. When the count reaches the threshold, an alarm is reported. | + +---------------------+---------------+-------------------------------------------------------------------------------------------------------------------+ + +- **File Channel** + + A file channel uses local disks as the cache. Events are stored in the folder specified by **dataDirs**. :ref:`Table 7 ` lists common configurations. + + .. _mrs_01_0396__td180d6190e86420d8779010b90877938: + + .. table:: **Table 7** Common configurations of a file channel + + +----------------------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +======================+=======================================+=================================================================================================================================================+ + | type | ``-`` | Specifies the type, which is set to **file**. This parameter can be set only in the properties.properties file. | + +----------------------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | checkpointDir | ${BIGDATA_DATA_HOME}/flume/checkpoint | Specifies the checkpoint storage directory. | + +----------------------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | dataDirs | ${BIGDATA_DATA_HOME}/flume/data | Specifies the data cache directory. Multiple directories can be configured to improve performance. The directories are separated by commas (,). | + +----------------------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | maxFileSize | 2146435071 | Specifies the maximum size of a single cache file, expressed in bytes. | + +----------------------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | minimumRequiredSpace | 524288000 | Specifies the minimum idle space in the cache, expressed in bytes. | + +----------------------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | capacity | 1000000 | Specifies the maximum number of events cached in a channel. | + +----------------------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | transactionCapacity | 10000 | Specifies the maximum number of events accessed each time. | + +----------------------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | channelfullcount | 10 | Specifies the channel full count. When the count reaches the threshold, an alarm is reported. | + +----------------------+---------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Kafka Channel** + + A Kafka channel uses a Kafka cluster as the cache. Kafka provides high availability and multiple copies to prevent data from being immediately consumed by sinks when Flume or Kafka Broker crashes. :ref:`Table 10 Common configurations of a Kafka channel ` lists common configurations. + + .. _mrs_01_0396__ta58e4ea5e98446418e498b81cf0c75b7: + + .. table:: **Table 8** Common configurations of a Kafka channel + + +----------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +==================================+=======================+=================================================================================================================+ + | type | ``-`` | Specifies the type, which is set to **org.apache.flume.channel.kafka.KafkaChannel**. | + | | | | + | | | This parameter can be set only in the properties.properties file. | + +----------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ + | kafka.bootstrap.servers | ``-`` | Specifies the list of Brokers in the Kafka cluster. | + +----------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ + | kafka.topic | flume-channel | Specifies the Kafka topic used by the channel to cache data. | + +----------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ + | kafka.consumer.group.id | flume | Specifies the Kafka consumer group ID. | + +----------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ + | parseAsFlumeEvent | true | Specifies whether data is parsed into Flume events. | + +----------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ + | migrateZookeeperOffsets | true | Specifies whether to search for offsets in ZooKeeper and submit them to Kafka when there is no offset in Kafka. | + +----------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ + | kafka.consumer.auto.offset.reset | latest | Consumes data from the specified location when there is no offset. | + +----------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ + | kafka.producer.security.protocol | SASL_PLAINTEXT | Specifies the Kafka producer security protocol. | + +----------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ + | kafka.consumer.security.protocol | SASL_PLAINTEXT | Specifies the Kafka consumer security protocol. | + +----------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ + +Common Sink Configurations +-------------------------- + +- **HDFS Sink** + + An HDFS sink writes data into HDFS. :ref:`Table 9 ` lists common configurations. + + .. _mrs_01_0396__t3f4509459f734167afdd0cb20857d2ef: + + .. table:: **Table 9** Common configurations of an HDFS sink + + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +==========================+=======================+=====================================================================================================================================================================================================================================================+ + | channel | **-** | Specifies the channel connected to the sink. This parameter can be set only in the properties.properties file. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | hdfs | Specifies the type, which is set to **hdfs**. This parameter can be set only in the properties.properties file. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | monTime | 0 (Disabled) | Specifies the thread monitoring threshold. When the update time exceeds the threshold, the sink is restarted. Unit: second | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.path | ``-`` | Specifies the HDFS path. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.inUseSuffix | .tmp | Specifies the suffix of the HDFS file to which data is being written. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.rollInterval | 30 | Specifies the interval for file rolling, expressed in seconds. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.rollSize | 1024 | Specifies the size for file rolling, expressed in bytes. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.rollCount | 10 | Specifies the number of events for file rolling. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.idleTimeout | 0 | Specifies the timeout interval for closing idle files automatically, expressed in seconds. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.batchSize | 1000 | Specifies the number of events written into HDFS at a time. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosPrincipal | ``-`` | Specifies the Kerberos username for HDFS authentication. This parameter is not required for a cluster in which Kerberos authentication is disabled. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosKeytab | ``-`` | Specifies the Kerberos keytab of HDFS authentication. This parameter is not required for a cluster in which Kerberos authentication is disabled. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.fileCloseByEndEvent | true | Specifies whether to close the file when the last event is received. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.batchCallTimeout | ``-`` | Specifies the timeout control duration each time events are written into HDFS, expressed in milliseconds. | + | | | | + | | | If this parameter is not specified, the timeout duration is controlled when each event is written into HDFS. When the value of **hdfs.batchSize** is greater than 0, configure this parameter to improve the performance of writing data into HDFS. | + | | | | + | | | .. note:: | + | | | | + | | | The value of **hdfs.batchCallTimeout** depends on **hdfs.batchSize**. A greater **hdfs.batchSize** requires a larger **hdfs.batchCallTimeout**. If the value of **hdfs.batchCallTimeout** is too small, writing events to HDFS may fail. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | serializer.appendNewline | true | Specifies whether to add a line feed character (**\\n**) after an event is written to HDFS. If a line feed character is added, the data volume counters used by the line feed character will not be calculated by HDFS sinks. | + +--------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Avro Sink** + + An Avro sink converts events into Avro events and sends them to the monitoring ports of the hosts. :ref:`Table 10 ` lists common configurations. + + .. _mrs_01_0396__tcf9863ee677d41a6882b71987541fa33: + + .. table:: **Table 10** Common configurations of an Avro sink + + +---------------------+---------------+-----------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=====================+===============+=================================================================================================================+ + | channel | **-** | Specifies the channel connected to the sink. This parameter can be set only in the properties.properties file. | + +---------------------+---------------+-----------------------------------------------------------------------------------------------------------------+ + | type | ``-`` | Specifies the type, which is set to **avro**. This parameter can be set only in the properties.properties file. | + +---------------------+---------------+-----------------------------------------------------------------------------------------------------------------+ + | hostname | ``-`` | Specifies the name or IP address of the bound host. | + +---------------------+---------------+-----------------------------------------------------------------------------------------------------------------+ + | port | ``-`` | Specifies the monitoring port. | + +---------------------+---------------+-----------------------------------------------------------------------------------------------------------------+ + | batch-size | 1000 | Specifies the number of events sent in a batch. | + +---------------------+---------------+-----------------------------------------------------------------------------------------------------------------+ + | ssl | false | Specifies whether to use SSL encryption. | + +---------------------+---------------+-----------------------------------------------------------------------------------------------------------------+ + | truststore-type | JKS | Specifies the Java trust store type. | + +---------------------+---------------+-----------------------------------------------------------------------------------------------------------------+ + | truststore | ``-`` | Specifies the Java trust store file. | + +---------------------+---------------+-----------------------------------------------------------------------------------------------------------------+ + | truststore-password | ``-`` | Specifies the Java trust store password. | + +---------------------+---------------+-----------------------------------------------------------------------------------------------------------------+ + | keystore-type | JKS | Specifies the key storage type. | + +---------------------+---------------+-----------------------------------------------------------------------------------------------------------------+ + | keystore | ``-`` | Specifies the key storage file. | + +---------------------+---------------+-----------------------------------------------------------------------------------------------------------------+ + | keystore-password | ``-`` | Specifies the key storage password. | + +---------------------+---------------+-----------------------------------------------------------------------------------------------------------------+ + +- **HBase Sink** + + An HBase sink writes data into HBase. :ref:`Table 11 ` lists common configurations. + + .. _mrs_01_0396__tf429beac69444e93a744abfe1d0fb744: + + .. table:: **Table 11** Common configurations of an HBase sink + + +-------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +===================+===============+======================================================================================================================================================+ + | channel | **-** | Specifies the channel connected to the sink. This parameter can be set only in the properties.properties file. | + +-------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | ``-`` | Specifies the type, which is set to **hbase**. This parameter can be set only in the properties.properties file. | + +-------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ + | table | ``-`` | Specifies the HBase table name. | + +-------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ + | monTime | 0 (Disabled) | Specifies the thread monitoring threshold. When the update time exceeds the threshold, the sink is restarted. Unit: second | + +-------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ + | columnFamily | ``-`` | Specifies the HBase column family. | + +-------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchSize | 1000 | Specifies the number of events written into HBase at a time. | + +-------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kerberosPrincipal | ``-`` | Specifies the Kerberos username for HBase authentication. This parameter is not required for a cluster in which Kerberos authentication is disabled. | + +-------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kerberosKeytab | ``-`` | Specifies the Kerberos keytab of HBase authentication. This parameter is not required for a cluster in which Kerberos authentication is disabled. | + +-------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Kafka Sink** + + A Kafka sink writes data into Kafka. :ref:`Table 12 ` lists common configurations. + + .. _mrs_01_0396__tf898876f2a2f45629655554005c3f0a8: + + .. table:: **Table 12** Common configurations of a Kafka sink + + +---------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=================================+=======================+===================================================================================================================================================================================+ + | channel | **-** | Specifies the channel connected to the sink. This parameter can be set only in the properties.properties file. | + +---------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | ``-`` | Specifies the type, which is set to **org.apache.flume.sink.kafka.KafkaSink**. | + | | | | + | | | This parameter can be set only in the properties.properties file. | + +---------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.bootstrap.servers | ``-`` | Specifies the list of Kafka Brokers, which are separated by commas. | + +---------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | monTime | 0 (Disabled) | Specifies the thread monitoring threshold. When the update time exceeds the threshold, the sink is restarted. Unit: second | + +---------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.topic | default-flume-topic | Specifies the topic where data is written. | + +---------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | flumeBatchSize | 1000 | Specifies the number of events written into Kafka at a time. | + +---------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.security.protocol | SASL_PLAINTEXT | Specifies the security protocol of Kafka. The value must be set to **PLAINTEXT** for clusters in which Kerberos authentication is disabled. | + +---------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.kerberos.domain.name | ``-`` | Specifies the Kafka domain name. This parameter is mandatory for a security cluster. This parameter can be set only in the properties.properties file. | + +---------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Other Kafka Producer Properties | ``-`` | Specifies other Kafka configurations. This parameter can be set to any production configuration supported by Kafka, and the **.kafka** prefix must be added to the configuration. | + | | | | + | | | This parameter can be set only in the properties.properties file. | + +---------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_flume/flume_service_configuration_guide.rst b/doc/component-operation-guide/source/using_flume/flume_service_configuration_guide.rst new file mode 100644 index 0000000..074709d --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/flume_service_configuration_guide.rst @@ -0,0 +1,783 @@ +:original_name: mrs_01_1057.html + +.. _mrs_01_1057: + +Flume Service Configuration Guide +================================= + +This section applies to MRS 3.\ *x* or later clusters. + +This configuration guide describes how to configure common Flume services. For non-common Source, Channel, and Sink configuration, see the user manual provided by the Flume community. + +.. note:: + + - Parameters in bold in the following tables are mandatory. + - The value of **BatchSize** of the Sink must be less than that of **transactionCapacity** of the Channel. + - Only some parameters of Source, Channel, and Sink are displayed on the Flume configuration tool page. For details, see the following configurations. + - The Customer Source, Customer Channel, and Customer Sink displayed on the Flume configuration tool page need to be configured based on self-developed code. The following common configurations are not displayed. + +Common Source Configurations +---------------------------- + +- **Avro Source** + + An Avro source listens to the Avro port, receives data from the external Avro client, and places data into configured channels. Common configurations are as follows: + + .. table:: **Table 1** Common configurations of an Avro source + + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=======================+=======================+=====================================================================================================================================================================================================================================================+ + | channels | ``-`` | Specifies the channel connected to the source. Multiple channels can be configured. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | avro | Specifies the type of the avro source, which must be **avro**. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | bind | ``-`` | Specifies the listening host name/IP address. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | port | ``-`` | Specifies the bound listening port. Ensure that this port is not occupied. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | threads | ``-`` | Specifies the maximum number of source threads. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | compression-type | none | Specifies the message compression format, which can be set to **none** or **deflate**. **none** indicates that data is not compressed, while **deflate** indicates that data is compressed. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | compression-level | 6 | Specifies the data compression level, which ranges from **1** to **9**. The larger the value is, the higher the compression rate is. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ssl | false | Specifies whether to use SSL encryption. If this parameter is set to **true**, the values of **keystore** and **keystore-password** must be specified. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore-type | JKS | Specifies the Java trust store type, which can be set to **JKS** or **PKCS12**. | + | | | | + | | | .. note:: | + | | | | + | | | Different passwords are used to protect the key store and private key of **JKS**, while the same password is used to protect the key store and private key of **PKCS12**. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore | ``-`` | Specifies the Java trust store file. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore-password | ``-`` | Specifies the Java trust store password. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore-type | JKS | Specifies the keystore type set after SSL is enabled, which can be set to **JKS** or **PKCS12**. | + | | | | + | | | .. note:: | + | | | | + | | | Different passwords are used to protect the key store and private key of **JKS**, while the same password is used to protect the key store and private key of **PKCS12**. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore | ``-`` | Specifies the keystore file path set after SSL is enabled. This parameter is mandatory if SSL is enabled. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore-password | ``-`` | Specifies the keystore password set after SSL is enabled. This parameter is mandatory if SSL is enabled. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | trust-all-certs | false | Specifies whether to disable the check for the SSL server certificate. If this parameter is set to **true**, the SSL server certificate of the remote source is not checked. You are not advised to perform this operation during the production. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | exclude-protocols | SSLv3 | Specifies the excluded protocols. The entered protocols must be separated by spaces. The default value is **SSLv3**. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ipFilter | false | Specifies whether to enable the IP address filtering. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ipFilter.rules | ``-`` | Specifies the rules of *N* network **ipFilters**. Host names or IP addresses must be separated by commas (,). If this parameter is set to **true**, there are two configuration rules: allow and forbidden. The configuration format is as follows: | + | | | | + | | | ipFilterRules=allow:ip:127.*, allow:name:localhost, deny:ip:\* | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **SpoolDir Source** + + SpoolDir Source monitors and transmits new files that have been added to directories in real-time mode. Common configurations are as follows: + + .. table:: **Table 2** Common configurations of a Spooling Directory source + + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +============================+=======================+==============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | channels | ``-`` | Specifies the channel connected to the source. Multiple channels can be configured. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | spooldir | Specifies the type of the spooling source, which must be set to **spooldir**. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spoolDir | ``-`` | Specifies the monitoring directory of the Spooldir source. A Flume running user must have the read, write, and execution permissions on the directory. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | monTime | 0 (Disabled) | Specifies the thread monitoring threshold. When the update time exceeds the threshold, the source is restarted. Unit: second | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | fileSuffix | .COMPLETED | Specifies the suffix added after file transmission is complete. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | deletePolicy | never | Specifies the source file deletion policy after file transmission is complete. The value can be either **never** or **immediate**. **never** indicates that the source file is not deleted after file transmission is complete, while **immediate** indicates that the source file is immediately deleted after file transmission is complete. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ignorePattern | ^$ | Specifies the regular expression of a file to be ignored. The default value is ^$, indicating that spaces are ignored. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | includePattern | ^.*$ | Specifies the regular expression that contains a file. This parameter can be used together with **ignorePattern**. If a file meets both **ignorePattern** and **includePattern**, the file is ignored. In addition, when a file starts with a period (.), the file will not be filtered. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | trackerDir | .flumespool | Specifies the metadata storage path during data transmission. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchSize | 1000 | Specifies the number of events written to the channel in batches. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | decodeErrorPolicy | FAIL | Specifies the code error policy. | + | | | | + | | | .. note:: | + | | | | + | | | If a code error occurs in the file, set **decodeErrorPolicy** to **REPLACE** or **IGNORE**. Flume will skip the code error and continue to collect subsequent logs. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | deserializer | LINE | Specifies the file parser. The value can be either **LINE** or **BufferedLine**. | + | | | | + | | | - When the value is set to **LINE**, characters read from the file are transcoded one by one. | + | | | - When the value is set to **BufferedLine**, one line or multiple lines of characters read from the file are transcoded in batches, which delivers better performance. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | deserializer.maxLineLength | 2048 | Specifies the maximum length for resolution by line. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | deserializer.maxBatchLine | 1 | Specifies the maximum number of lines for resolution by line. If multiple lines are set, **maxLineLength** must be set to a corresponding multiplier. | + | | | | + | | | .. note:: | + | | | | + | | | When configuring the Interceptor, take the multi-line combination into consideration to avoid data loss. If the Interceptor cannot process combined lines, set this parameter to **1**. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | selector.type | replicating | Specifies the selector type. The value can be either **replicating** or **multiplexing**. **replicating** indicates that data is replicated and then transferred to each channel so that each channel receives the same data, while **multiplexing** indicates that a channel is selected based on the value of the header in the event and each channel has different data. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | interceptors | ``-`` | Specifies the interceptor. Multiple interceptors are separated by spaces. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | inputCharset | UTF-8 | Specifies the encoding format of a read file. The encoding format must be the same as that of the data source file that has been read. Otherwise, an error may occur during character parsing. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | fileHeader | false | Specifies whether to add the file name (including the file path) to the event header. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | fileHeaderKey | ``-`` | Specifies that the data storage structure in header is set in the mode. Parameters **fileHeaderKey** and **fileHeader** must be used together. Following is an example if **fileHeader** is set to true: | + | | | | + | | | Define **fileHeaderKey** as **file**. When the **/root/a.txt** file is read, **fileHeaderKey** exists in the header in the **file=/root/a.txt** format. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | basenameHeader | false | Specifies whether to add the file name (excluding the file path) to the event header. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | basenameHeaderKey | ``-`` | Specifies that the data storage structure in header is set in the mode. Parameters **basenameHeaderKey** and **basenameHeader** must be used together. Following is an example if **basenameHeader** is set to **true**: | + | | | | + | | | Define **basenameHeaderKey** as **file**. When the **a.txt** file is read, **fileHeaderKey** exists in the header in the **file=a.txt** format. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | pollDelay | 500 | Specifies the delay for polling new files in the monitoring directory. Unit: milliseconds | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | recursiveDirectorySearch | false | Specifies whether to monitor new files in the subdirectory of the configured directory. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | consumeOrder | oldest | Specifies the consumption order of files in a directory. If this parameter is set to **oldest** or **youngest**, the sequence of files to be read is determined by the last modification time of files in the monitored directory. If there are a large number of files in the directory, it takes a long time to search for **oldest** or **youngest** files. If this parameter is set to **random**, an earlier created file may not be read for a long time. If this parameter is set to **oldest** or **youngest**, it takes a long time to find the latest and the earliest file. The options are as follows: **random**, **youngest**, and **oldest**. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | maxBackoff | 4000 | Specifies the maximum time to wait between consecutive attempts to write to a channel if the channel is full. If the time exceeds the threshold, an exception is thrown. The corresponding source starts to write at a smaller time value. Each time the source attempts, the digital exponent increases until the current specified value is reached. If data cannot be written, the data write fails. Unit: second | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | emptyFileEvent | true | Specifies whether to collect empty file information and send it to the sink end. The default value is **true**, indicating that empty file information is sent to the sink end. This parameter is valid only for HDFS Sink. Taking HDFS Sink as an example, if this parameter is set to **true** and an empty file exists in the **spoolDir** directory, an empty file with the same name will be created in the **hdfs.path** directory of HDFS. | + +----------------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + SpoolDir Source ignores the last line feed character of each event when data is reading by row. Therefore, Flume does not calculate the data volume counters used by the last line feed character. + +- **Kafka Source** + + A Kafka source consumes data from Kafka topics. Multiple sources can consume data of the same topic, and the sources consume different partitions of the topic. Common configurations are as follows: + + .. table:: **Table 3** Common configurations of a Kafka source + + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=================================+===========================================+==============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | channels | ``-`` | Specifies the channel connected to the source. Multiple channels can be configured. | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | org.apache.flume.source.kafka.KafkaSource | Specifies the type of the Kafka source, which must be set to **org.apache.flume.source.kafka.KafkaSource**. | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.bootstrap.servers | ``-`` | Specifies the bootstrap address port list of Kafka. If Kafka has been installed in the cluster and the configuration has been synchronized to the server, you do not need to set this parameter on the server. The default value is the list of all brokers in the Kafka cluster. This parameter must be configured on the client. Use commas (,) to separate multiple values of *IP address:Port number*. The rules for matching ports and security protocols must be as follows: port 21007 matches the security mode (SASL_PLAINTEXT), and port 9092 matches the common mode (PLAINTEXT). | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.topics | ``-`` | Specifies the list of subscribed Kafka topics, which are separated by commas (,). | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.topics.regex | ``-`` | Specifies the subscribed topics that comply with regular expressions. **kafka.topics.regex** has a higher priority than **kafka.topics** and will overwrite **kafka.topics**. | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | monTime | 0 (Disabled) | Specifies the thread monitoring threshold. When the update time exceeds the threshold, the source is restarted. Unit: second | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | nodatatime | 0 (Disabled) | Specifies the alarm threshold. An alarm is triggered when the duration that Kafka does not release data to subscribers exceeds the threshold. Unit: second This parameter can be configured in the **properties.properties** file. | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchSize | 1000 | Specifies the number of events written to the channel in batches. | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchDurationMillis | 1000 | Specifies the maximum duration of topic data consumption at a time, expressed in milliseconds. | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keepTopicInHeader | false | Specifies whether to save topics in the event header. If the parameter value is **true**, topics configured in Kafka Sink become invalid. | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | setTopicHeader | true | If this parameter is set to **true**, the topic name defined in **topicHeader** is stored in the header. | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | topicHeader | topic | When **setTopicHeader** is set to **true**, this parameter specifies the name of the topic received by the storage device. If the property is used with that of Kafka Sink **topicHeader**, be careful not to send messages to the same topic cyclically. | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | useFlumeEventFormat | false | By default, an event is transferred from a Kafka topic to the body of the event in the form of bytes. If this parameter is set to **true**, the Avro binary format of Flume is used to read events. When used together with the **parseAsFlumeEvent** parameter with the same name in KafkaSink or KakfaChannel, any set **header** generated from the data source is retained. | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keepPartitionInHeader | false | Specifies whether to save partition IDs in the event header. If the parameter value is **true**, Kafka Sink writes data to the corresponding partition. | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.consumer.group.id | flume | Specifies the Kafka consumer group ID. Sources or proxies having the same ID are in the same consumer group. | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.security.protocol | SASL_PLAINTEXT | Specifies the Kafka security protocol. The parameter value must be set to PLAINTEXT in a common cluster. The rules for matching ports and security protocols must be as follows: port 21007 matches the security mode (SASL_PLAINTEXT), and port 9092 matches the common mode (PLAINTEXT). | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Other Kafka Consumer Properties | ``-`` | Specifies other Kafka configurations. This parameter can be set to any consumption configuration supported by Kafka, and the **.kafka** prefix must be added to the configuration. | + +---------------------------------+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Taildir Source** + + A Taildir source monitors file changes in a directory and automatically reads the file content. In addition, it can transmit data in real time. Common configurations are as follows: + + .. table:: **Table 4** Common configurations of a Taildir source + + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +========================================+=======================+==========================================================================================================================================================================================================================================================================================================================================================================================+ + | channels | ``-`` | Specifies the channel connected to the source. Multiple channels can be configured. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | TAILDIR | Specifies the type of the taildir source, which must be set to TAILDIR. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | filegroups | ``-`` | Specifies the group name of a collection file directory. Group names are separated by spaces. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | filegroups..parentDir | ``-`` | Specifies the parent directory. The value must be an absolute path. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | filegroups..filePattern | ``-`` | Specifies the relative file path of the file group's parent directory. Directories can be included and regular expressions are supported. It must be used together with **parentDir**. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | positionFile | ``-`` | Specifies the metadata storage path during data transmission. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | headers.. | ``-`` | Specifies the key-value of an event when data of a group is being collected. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | byteOffsetHeader | false | Specifies whether each event header contains the event location information in the source file. If the parameter value is true, the location information is saved in the byteoffset variable. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | maxBatchCount | Long.MAX_VALUE | Specifies the maximum number of batches that can be consecutively read from a file. If the monitored directory reads multiple files consecutively and one of the files is written at a rapid rate, other files may fail to be processed. This is because the file that is written at a high speed will be in an infinite read loop. In this case, set this parameter to a smaller value. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | skipToEnd | false | Specifies whether Flume can locate the latest location of a file and read the latest data after restart. If the parameter value is true, Flume locates and reads the latest file data after restart. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | idleTimeout | 120000 | Specifies the idle duration during file reading, expressed in milliseconds. If file content is not changed in the preset time duration, close the file. If data is written to this file after the file is closed, open the file and read data. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | writePosInterval | 3000 | Specifies the interval for writing metadata to a file, expressed in milliseconds. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchSize | 1000 | Specifies the number of events written to the channel in batches. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | monTime | 0 (Disabled) | Specifies the thread monitoring threshold. When the update time exceeds the threshold, the source is restarted. Unit: second | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | fileHeader | false | Specifies whether to add the file name (including the file path) to the event header. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | fileHeaderKey | file | Specifies that the data storage structure in header is set in the mode. Parameters **fileHeaderKey** and **fileHeader** must be used together. Following is an example if **fileHeader** is set to true: | + | | | | + | | | Define **fileHeaderKey** as **file**. When the **/root/a.txt** file is read, **fileHeaderKey** exists in the header in the **file=/root/a.txt** format. | + +----------------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Http Source** + + An HTTP source receives data from an external HTTP client and sends the data to the configured channels. Common configurations are as follows: + + .. table:: **Table 5** Common configurations of an HTTP source + + +-----------------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=======================+==========================================+==================================================================================================================================================================================================+ + | channels | ``-`` | Specifies the channel connected to the source. Multiple channels can be configured. | + +-----------------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | http | Specifies the type of the http source, which must be set to http. | + +-----------------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | bind | ``-`` | Specifies the listening host name/IP address. | + +-----------------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | port | ``-`` | Specifies the bound listening port. Ensure that this port is not occupied. | + +-----------------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | handler | org.apache.flume.source.http.JSONHandler | Specifies the message parsing method of an HTTP request. Two formats are supported: JSON (org.apache.flume.source.http.JSONHandler) and BLOB (org.apache.flume.sink.solr.morphline.BlobHandler). | + +-----------------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | handler.\* | ``-`` | Specifies handler parameters. | + +-----------------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | exclude-protocols | SSLv3 | Specifies the excluded protocols. The entered protocols must be separated by spaces. The default value is **SSLv3**. | + +-----------------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | include-cipher-suites | ``-`` | Specifies the included protocols. The entered protocols must be separated by spaces. If this parameter is left empty, all protocols are supported by default. | + +-----------------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | enableSSL | false | Specifies whether SSL is enabled in HTTP. If this parameter is set to **true**, the values of **keystore** and **keystore-password** must be specified. | + +-----------------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore-type | JKS | Specifies the keystore type, which can be **JKS** or **PKCS12**. | + +-----------------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore | ``-`` | Specifies the keystore path set after SSL is enabled in HTTP. | + +-----------------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystorePassword | ``-`` | Specifies the keystore password set after SSL is enabled in HTTP. | + +-----------------------+------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Thrift Source** + + Thrift Source monitors the thrift port, receives data from the external Thrift clients, and puts the data into the configured channel. Common configurations are as follows: + + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=======================+=======================+=========================================================================================================================================================================================================================================================================================================================================+ + | channels | ``-`` | Specifies the channel connected to the source. Multiple channels can be configured. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | thrift | Specifies the type of the thrift source, which must be set to **thrift**. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | bind | ``-`` | Specifies the listening host name/IP address. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | port | ``-`` | Specifies the bound listening port. Ensure that this port is not occupied. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | threads | ``-`` | Specifies the maximum number of worker threads that can be run. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kerberos | false | Specifies whether Kerberos authentication is enabled. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | agent-keytab | ``-`` | Specifies the address of the keytab file used by the server. The machine-machine account must be used. You are advised to use **flume/conf/flume_server.keytab** in the Flume service installation directory. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | agent-principal | ``-`` | Specifies the principal of the security user used by the server. The principal must be a machine-machine account. You are advised to use the default user of Flume: flume_server/hadoop.\ **\ @\ ** | + | | | | + | | | .. note:: | + | | | | + | | | **flume_server/hadoop.**\ <*system domain name*> is the username. All letters in the system domain name contained in the username are lowercase letters. For example, **Local Domain** is set to **9427068F-6EFA-4833-B43E-60CB641E5B6C.COM**, and the username is **flume_server/hadoop.9427068f-6efa-4833-b43e-60cb641e5b6c.com**. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | compression-type | none | Specifies the message compression format, which can be set to **none** or **deflate**. **none** indicates that data is not compressed, while **deflate** indicates that data is compressed. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ssl | false | Specifies whether to use SSL encryption. If this parameter is set to **true**, the values of **keystore** and **keystore-password** must be specified. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore-type | JKS | Specifies the keystore type set after SSL is enabled. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore | ``-`` | Specifies the keystore file path set after SSL is enabled. This parameter is mandatory if SSL is enabled. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore-password | ``-`` | Specifies the keystore password set after SSL is enabled. This parameter is mandatory if SSL is enabled. | + +-----------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Common Channel Configurations +----------------------------- + +- **Memory Channel** + + A memory channel uses memory as the cache. Events are stored in memory queues. Common configurations are as follows: + + .. table:: **Table 6** Common configurations of a memory channel + + +------------------------------+-------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +==============================+===============================+=========================================================================================================================================================+ + | type | ``-`` | Specifies the type of the memory channel, which must be set to **memory**. | + +------------------------------+-------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | capacity | 10000 | Specifies the maximum number of events cached in a channel. | + +------------------------------+-------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | transactionCapacity | 1000 | Specifies the maximum number of events accessed each time. | + | | | | + | | | .. note:: | + | | | | + | | | - The parameter value must be greater than the batchSize of the source and sink. | + | | | - The value of **transactionCapacity** must be less than or equal to that of **capacity**. | + +------------------------------+-------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | channelfullcount | 10 | Specifies the channel full count. When the count reaches the threshold, an alarm is reported. | + +------------------------------+-------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keep-alive | 3 | Specifies the waiting time of the Put and Take threads when the transaction or channel cache is full. Unit: second | + +------------------------------+-------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | byteCapacity | 80% of the maximum JVM memory | Specifies the total bytes of all event bodies in a channel. The default value is the 80% of the maximum JVM memory (indicated by **-Xmx**). Unit: bytes | + +------------------------------+-------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | byteCapacityBufferPercentage | 20 | Specifies the percentage of bytes in a channel (%). | + +------------------------------+-------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **File Channel** + + A file channel uses local disks as the cache. Events are stored in the folder specified by **dataDirs**. Common configurations are as follows: + + .. table:: **Table 7** Common configurations of a file channel + + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=======================+======================================================+=================================================================================================================================================+ + | type | ``-`` | Specifies the type of the file channel, which must be set to **file**. | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | checkpointDir | ${BIGDATA_DATA_HOME}/hadoop/data1~N/flume/checkpoint | Specifies the checkpoint storage directory. | + | | | | + | | .. note:: | | + | | | | + | | This path is changed with the custom data path. | | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | dataDirs | ${BIGDATA_DATA_HOME}/hadoop/data1~N/flume/data | Specifies the data cache directory. Multiple directories can be configured to improve performance. The directories are separated by commas (,). | + | | | | + | | .. note:: | | + | | | | + | | This path is changed with the custom data path. | | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | maxFileSize | 2146435071 | Specifies the maximum size of a single cache file, expressed in bytes. | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | minimumRequiredSpace | 524288000 | Specifies the minimum idle space in the cache, expressed in bytes. | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | capacity | 1000000 | Specifies the maximum number of events cached in a channel. | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | transactionCapacity | 10000 | Specifies the maximum number of events accessed each time. | + | | | | + | | | .. note:: | + | | | | + | | | - The parameter value must be greater than the batchSize of the source and sink. | + | | | - The value of **transactionCapacity** must be less than or equal to that of **capacity**. | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | channelfullcount | 10 | Specifies the channel full count. When the count reaches the threshold, an alarm is reported. | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | useDualCheckpoints | false | Specifies the backup checkpoint. If this parameter is set to **true**, the **backupCheckpointDir** parameter value must be set. | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | backupCheckpointDir | ``-`` | Specifies the path of the backup checkpoint. | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | checkpointInterval | 30000 | Specifies the check interval, expressed in seconds. | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | keep-alive | 3 | Specifies the waiting time of the Put and Take threads when the transaction or channel cache is full. Unit: second | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | use-log-replay-v1 | false | Specifies whether to enable the old reply logic. | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | use-fast-replay | false | Specifies whether to enable the queue reply. | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + | checkpointOnClose | true | Specifies that whether a checkpoint is created when a channel is disabled. | + +-----------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Memory File Channel** + + A memory file channel uses both memory and local disks as its cache and supports message persistence. It provides similar performance as a memory channel and better performance than a file channel. This channel is currently experimental and not recommended for use in production. The following table describes common configuration items: Common configurations are as follows: + + .. table:: **Table 8** Common configurations of a memory file channel + + +-----------------------+--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=======================+============================================+=============================================================================================================================================================================================================================================================================================================================================================================================================+ + | type | org.apache.flume.channel.MemoryFileChannel | Specifies the type of the memory file channel, which must be set to **org.apache.flume.channel.MemoryFileChannel**. | + +-----------------------+--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | capacity | 50000 | Specifies the maximum number of events cached in a channel. | + +-----------------------+--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | transactionCapacity | 5000 | Specifies the maximum number of events processed by a transaction. | + | | | | + | | | .. note:: | + | | | | + | | | - The parameter value must be greater than the batchSize of the source and sink. | + | | | - The value of **transactionCapacity** must be less than or equal to that of **capacity**. | + +-----------------------+--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | subqueueByteCapacity | 20971520 | Specifies the maximum size of events that can be stored in a subqueue, expressed in bytes. | + | | | | + | | | A memory file channel uses both queues and subqueues to cache data. Events are stored in a subqueue, and subqueues are stored in a queue. | + | | | | + | | | **subqueueCapacity** and **subqueueInterval** determine the size of events that can be stored in a subqueue. **subqueueCapacity** specifies the capacity of a subqueue, and **subqueueInterval** specifies the duration that a subqueue can store events. Events in a subqueue are sent to the destination only after the subqueue reaches the upper limit of **subqueueCapacity** or **subqueueInterval**. | + | | | | + | | | .. note:: | + | | | | + | | | The value of **subqueueByteCapacity** must be greater than the number of events specified by **batchSize**. | + +-----------------------+--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | subqueueInterval | 2000 | Specifies the maximum duration that a subqueue can store events, expressed in milliseconds. | + +-----------------------+--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keep-alive | 3 | Specifies the waiting time of the Put and Take threads when the transaction or channel cache is full. | + | | | | + | | | Unit: second | + +-----------------------+--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | dataDir | ``-`` | Specifies the cache directory for local files. | + +-----------------------+--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | byteCapacity | 80% of the maximum JVM memory | Specifies the channel cache capacity. | + | | | | + | | | Unit: bytes | + +-----------------------+--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | compression-type | None | Specifies the message compression format, which can be set to **none** or **deflate**. **none** indicates that data is not compressed, while **deflate** indicates that data is compressed. | + +-----------------------+--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | channelfullcount | 10 | Specifies the channel full count. When the count reaches the threshold, an alarm is reported. | + +-----------------------+--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + The following is a configuration example of a memory file channel: + + .. code-block:: + + server.channels.c1.type = org.apache.flume.channel.MemoryFileChannel + server.channels.c1.dataDir = /opt/flume/mfdata + server.channels.c1.subqueueByteCapacity = 20971520 + server.channels.c1.subqueueInterval=2000 + server.channels.c1.capacity = 500000 + server.channels.c1.transactionCapacity = 40000 + +- **Kafka Channel** + + A Kafka channel uses a Kafka cluster as the cache. Kafka provides high availability and multiple copies to prevent data from being immediately consumed by sinks when Flume or Kafka Broker crashes. + + .. table:: **Table 9** Common configurations of a Kafka channel + + +----------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +==================================+=======================+==========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | type | ``-`` | Specifies the type of the Kafka channel, which must be set to **org.apache.flume.channel.kafka.KafkaChannel**. | + +----------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.bootstrap.servers | ``-`` | Specifies the bootstrap address port list of Kafka. | + | | | | + | | | If Kafka has been installed in the cluster and the configuration has been synchronized to the server, you do not need to set this parameter on the server. The default value is the list of all brokers in the Kafka cluster. This parameter must be configured on the client. Use commas (,) to separate multiple values of *IP address:Port number*. The rules for matching ports and security protocols must be as follows: port 21007 matches the security mode (SASL_PLAINTEXT), and port 9092 matches the common mode (PLAINTEXT). | + +----------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.topic | flume-channel | Specifies the Kafka topic used by the channel to cache data. | + +----------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.consumer.group.id | flume | Specifies the data group ID obtained from Kafka. This parameter cannot be left blank. | + +----------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | parseAsFlumeEvent | true | Specifies whether data is parsed into Flume events. | + +----------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | migrateZookeeperOffsets | true | Specifies whether to search for offsets in ZooKeeper and submit them to Kafka when there is no offset in Kafka. | + +----------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.consumer.auto.offset.reset | latest | Specifies where to consume if there is no offset record, which can be set to **earliest**, **latest**, or **none**. **earliest** indicates that the offset is reset to the initial point, **latest** indicates that the offset is set to the latest position, and **none** indicates that an exception is thrown if there is no offset. | + +----------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.producer.security.protocol | SASL_PLAINTEXT | Specifies the Kafka producer security protocol. The rules for matching ports and security protocols must be as follows: port 21007 matches the security mode (SASL_PLAINTEXT), and port 9092 matches the common mode (PLAINTEXT). | + | | | | + | | | .. note:: | + | | | | + | | | If the parameter is not displayed, click **+** in the lower left corner of the dialog box to display all parameters. | + +----------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.consumer.security.protocol | SASL_PLAINTEXT | Specifies the Kafka consumer security protocol. The rules for matching ports and security protocols must be as follows: port 21007 matches the security mode (SASL_PLAINTEXT), and port 9092 matches the common mode (PLAINTEXT). | + +----------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | pollTimeout | 500 | Specifies the maximum timeout interval for the consumer to invoke the poll function. Unit: milliseconds | + +----------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ignoreLongMessage | false | Specifies whether to discard oversized messages. | + +----------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | messageMaxLength | 1000012 | Specifies the maximum length of a message written by Flume to Kafka. | + +----------------------------------+-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Common Sink Configurations +-------------------------- + +- **HDFS Sink** + + An HDFS sink writes data into HDFS. Common configurations are as follows: + + .. table:: **Table 10** Common configurations of an HDFS sink + + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +==========================+=======================+=====================================================================================================================================================================================================================================================================================================================================================================+ + | channel | ``-`` | Specifies the channel connected to the sink. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | hdfs | Specifies the type of the hdfs sink, which must be set to **hdfs**. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.path | ``-`` | Specifies the data storage path in HDFS. The value must start with **hdfs://hacluster/**. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | monTime | 0 (Disabled) | Specifies the thread monitoring threshold. When the update time exceeds the threshold, the sink is restarted. Unit: second | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.inUseSuffix | .tmp | Specifies the suffix of the HDFS file to which data is being written. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.rollInterval | 30 | Specifies the interval for file rolling, expressed in seconds. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.rollSize | 1024 | Specifies the size for file rolling, expressed in bytes. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.rollCount | 10 | Specifies the number of events for file rolling. | + | | | | + | | | .. note:: | + | | | | + | | | Parameters **rollInterval**, **rollSize**, and **rollCount** can be configured at the same time. The parameter meeting the requirements takes precedence for compression. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.idleTimeout | 0 | Specifies the timeout interval for closing idle files automatically, expressed in seconds. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.batchSize | 1000 | Specifies the number of events written into HDFS in batches. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosPrincipal | ``-`` | Specifies the Kerberos principal of HDFS authentication. This parameter is mandatory in a secure mode, but not required in a common mode. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosKeytab | ``-`` | Specifies the Kerberos keytab of HDFS authentication. This parameter is not required in a common mode, but in a secure mode, the Flume running user must have the permission to access **keyTab** path in the **jaas.cof** file. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.fileCloseByEndEvent | true | Specifies whether to close the HDFS file when the last event of the source file is received. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.batchCallTimeout | ``-`` | Specifies the timeout control duration when events are written into HDFS in batches. Unit: milliseconds | + | | | | + | | | If this parameter is not specified, the timeout duration is controlled when each event is written into HDFS. When the value of **hdfs.batchSize** is greater than 0, configure this parameter to improve the performance of writing data into HDFS. | + | | | | + | | | .. note:: | + | | | | + | | | The value of **hdfs.batchCallTimeout** depends on **hdfs.batchSize**. A greater **hdfs.batchSize** requires a larger **hdfs.batchCallTimeout**. If the value of **hdfs.batchCallTimeout** is too small, writing events to HDFS may fail. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | serializer.appendNewline | true | Specifies whether to add a line feed character (**\\n**) after an event is written to HDFS. If a line feed character is added, the data volume counters used by the line feed character will not be calculated by HDFS sinks. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.filePrefix | over_%{basename} | Specifies the file name prefix after data is written to HDFS. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.fileSuffix | ``-`` | Specifies the file name suffix after data is written to HDFS. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.inUsePrefix | ``-`` | Specifies the prefix of the HDFS file to which data is being written. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.fileType | DataStream | Specifies the HDFS file format, which can be set to **SequenceFile**, **DataStream**, or **CompressedStream**. | + | | | | + | | | .. note:: | + | | | | + | | | If the parameter is set to **SequenceFile** or **DataStream**, output files are not compressed, and the **codeC** parameter cannot be configured. However, if the parameter is set to **CompressedStream**, the output files are compressed, and the **codeC** parameter must be configured together. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.codeC | ``-`` | Specifies the file compression format, which can be set to **gzip**, **bzip2**, **lzo**, **lzop**, or **snappy**. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.maxOpenFiles | 5000 | Specifies the maximum number of HDFS files that can be opened. If the number of opened files reaches this value, the earliest opened files are closed. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.writeFormat | Writable | Specifies the file write format, which can be set to **Writable** or **Text**. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.callTimeout | 10000 | Specifies the timeout control duration each time events are written into HDFS, expressed in milliseconds. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.threadsPoolSize | ``-`` | Specifies the number of threads used by each HDFS sink for HDFS I/O operations. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.rollTimerPoolSize | ``-`` | Specifies the number of threads used by each HDFS sink to schedule the scheduled file rolling. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.round | false | Specifies whether to round off the timestamp value. If this parameter is set to true, all time-based escape sequences (except %t) are affected. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.roundUnit | second | Specifies the unit of the timestamp value that has been rounded off, which can be set to **second**, **minute**, or **hour**. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.useLocalTimeStamp | true | Specifies whether to enable the local timestamp. The recommended parameter value is **true**. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.closeTries | 0 | Specifies the maximum attempts for the **hdfs sink** to stop renaming a file. If the parameter is set to the default value **0**, the sink does not stop renaming the file until the file is successfully renamed. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.retryInterval | 180 | Specifies the interval of request for closing the HDFS file, expressed in seconds. | + | | | | + | | | .. note:: | + | | | | + | | | For each closing request, there are multiple RPCs working on the NameNode back and forth, which may make the NameNode overloaded if the parameter value is too small. Also, when the parameter is set to **0**, the Sink will not attempt to close the file, but opens the file or uses **.tmp** as the file name extension, if the first closing attempt fails. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.failcount | 10 | Specifies the number of times that data fails to be written to HDFS. If the number of times that the sink fails to write data to HDFS exceeds the parameter value, an alarm indicating abnormal data transmission is reported. | + +--------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Avro Sink** + + An Avro sink converts events into Avro events and sends them to the monitoring ports of the hosts. Common configurations are as follows: + + .. table:: **Table 11** Common configurations of an Avro sink + + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +===========================+=======================+=======================================================================================================================================================================================================================================================================================+ + | channel | ``-`` | Specifies the channel connected to the sink. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | ``-`` | Specifies the type of the avro sink, which must be set to **avro**. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hostname | ``-`` | Specifies the bound host name or IP address. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | port | ``-`` | Specifies the bound listening port. Ensure that this port is not occupied. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batch-size | 1000 | Specifies the number of events sent in a batch. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | client.type | DEFAULT | Specifies the client instance type. Set this parameter based on the communication protocol used by the configured model. The options are as follows: | + | | | | + | | | - **DEFAULT**: The client instance of the AvroRPC type is returned. | + | | | - **OTHER**: NULL is returned. | + | | | - **THRIFT**: The client instance of the Thrift RPC type is returned. | + | | | - **DEFAULT_LOADBALANCING**: The client instance of the LoadBalancing RPC type is returned. | + | | | - **DEFAULT_FAILOVER**: The client instance of the Failover RPC type is returned. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ssl | false | Specifies whether to use SSL encryption. If this parameter is set to **true**, the values of **keystore** and **keystore-password** must be specified. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore-type | JKS | Specifies the Java trust store type, which can be set to **JKS** or **PKCS12**. | + | | | | + | | | .. note:: | + | | | | + | | | Different passwords are used to protect the key store and private key of **JKS**, while the same password is used to protect the key store and private key of **PKCS12**. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore | ``-`` | Specifies the Java trust store file. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore-password | ``-`` | Specifies the Java trust store password. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore-type | JKS | Specifies the keystore type set after SSL is enabled. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore | ``-`` | Specifies the keystore file path set after SSL is enabled. This parameter is mandatory if SSL is enabled. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | keystore-password | ``-`` | Specifies the keystore password after SSL is enabled. This parameter is mandatory if SSL is enabled. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | connect-timeout | 20000 | Specifies the timeout for the first connection, expressed in milliseconds. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | request-timeout | 20000 | Specifies the maximum timeout for a request after the first request, expressed in milliseconds. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | reset-connection-interval | 0 | Specifies the interval between a connection failure and a second connection, expressed in seconds. If the parameter is set to **0**, the system continuously attempts to perform a connection. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | compression-type | none | Specifies the compression type of the batch data, which can be set to **none** or **deflate**. **none** indicates that data is not compressed, while **deflate** indicates that data is compressed. This parameter value must be the same as that of the AvroSource compression-type. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | compression-level | 6 | Specifies the compression level of batch data, which can be set to **1** to **9**. A larger value indicates a higher compression rate. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | exclude-protocols | SSLv3 | Specifies the excluded protocols. The entered protocols must be separated by spaces. The default value is **SSLv3**. | + +---------------------------+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **HBase Sink** + + An HBase sink writes data into HBase. Common configurations are as follows: + + .. table:: **Table 12** Common configurations of an HBase sink + + +--------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +====================+===============+===================================================================================================================================================================================================================================+ + | channel | ``-`` | Specifies the channel connected to the sink. | + +--------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | ``-`` | Specifies the type of the HBase sink, which must be set to **hbase**. | + +--------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | table | ``-`` | Specifies the HBase table name. | + +--------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | columnFamily | ``-`` | Specifies the HBase column family. | + +--------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | monTime | 0 (Disabled) | Specifies the thread monitoring threshold. When the update time exceeds the threshold, the sink is restarted. Unit: second | + +--------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchSize | 1000 | Specifies the number of events written into HBase in batches. | + +--------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kerberosPrincipal | ``-`` | Specifies the Kerberos principal of HBase authentication. This parameter is mandatory in a secure mode, but not required in a common mode. | + +--------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kerberosKeytab | ``-`` | Specifies the Kerberos keytab of HBase authentication. This parameter is not required in a common mode, but in a secure mode, the Flume running user must have the permission to access **keyTab** path in the **jaas.cof** file. | + +--------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | coalesceIncrements | true | Specifies whether to perform multiple operations on the same hbase cell in a same processing batch. Setting this parameter to **true** improves performance. | + +--------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Kafka Sink** + + A Kafka sink writes data into Kafka. Common configurations are as follows: + + .. table:: **Table 13** Common configurations of a Kafka sink + + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +=================================+================+=============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | channel | ``-`` | Specifies the channel connected to the sink. | + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | ``-`` | Specifies the type of the kafka sink, which must be set to **org.apache.flume.sink.kafka.KafkaSink**. | + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.bootstrap.servers | ``-`` | Specifies the bootstrap address port list of Kafka. If Kafka has been installed in the cluster and the configuration has been synchronized to the server, you do not need to set this parameter on the server. The default value is the list of all brokers in the Kafka cluster. The client must be configured with this parameter. If there are multiple values, use commas (,) to separate the values. The rules for matching ports and security protocols must be as follows: port 21007 matches the security mode (SASL_PLAINTEXT), and port 9092 matches the common mode (PLAINTEXT). | + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | monTime | 0 (Disabled) | Specifies the thread monitoring threshold. When the update time exceeds the threshold, the sink is restarted. Unit: second | + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.producer.acks | 1 | Successful write is determined by the number of received acknowledgement messages about replicas. The value **0** indicates that no confirm message needs to be received, the value **1** indicates that the system is only waiting for only the acknowledgement information from a leader, and the value **-1** indicates that the system is waiting for the acknowledgement messages of all replicas. If this parameter is set to **-1**, data loss can be avoided in some leader failure scenarios. | + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.topic | ``-`` | Specifies the topic to which data is written. This parameter is mandatory. | + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | flumeBatchSize | 1000 | Specifies the number of events written into Kafka in batches. | + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.security.protocol | SASL_PLAINTEXT | Specifies the Kafka security protocol. The parameter value must be set to PLAINTEXT in a common cluster. The rules for matching ports and security protocols must be as follows: port 21007 matches the security mode (SASL_PLAINTEXT), and port 9092 matches the common mode (PLAINTEXT). | + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ignoreLongMessage | false | Specifies whether to discard oversized messages. | + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | messageMaxLength | 1000012 | Specifies the maximum length of a message written by Flume to Kafka. | + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | defaultPartitionId | ``-`` | Specifies the Kafka partition ID to which the events of a channel is transferred. The **partitionIdHeader** value overwrites this parameter value. By default, if this parameter is left blank, events will be distributed by the Kafka Producer's partitioner (by a specified key or a partitioner customized by **kafka.partitioner.class**). | + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | partitionIdHeader | ``-`` | When you set this parameter, the sink will take the value of the field named using the value of this property from the event header and send the message to the specified partition of the topic. If the value does not have a valid partition, **EventDeliveryException** is thrown. If the header value already exists, this setting overwrites the **defaultPartitionId** parameter. | + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Other Kafka Producer Properties | ``-`` | Specifies other Kafka configurations. This parameter can be set to any production configuration supported by Kafka, and the **.kafka** prefix must be added to the configuration. | + +---------------------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Thrift Sink** + + A Thrift sink converts events to Thrift events and sends them to the monitoring port of the configured host. Common configurations are as follows: + + .. table:: **Table 14** Common configurations of a Thrift sink + + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Description | + +===========================+===============+=========================================================================================================================================================================================================+ + | channel | ``-`` | Specifies the channel connected to the sink. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | type | thrift | Specifies the type of the thrift sink, which must be set to **thrift**. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hostname | ``-`` | Specifies the bound host name or IP address. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | port | ``-`` | Specifies the bound listening port. Ensure that this port is not occupied. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batch-size | 1000 | Specifies the number of events sent in a batch. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | connect-timeout | 20000 | Specifies the timeout for the first connection, expressed in milliseconds. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | request-timeout | 20000 | Specifies the maximum timeout for a request after the first request, expressed in milliseconds. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kerberos | false | Specifies whether Kerberos authentication is enabled. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | client-keytab | ``-`` | Specifies the path of the client **keytab** file. The Flume running user must have the access permission on the authentication file. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | client-principal | ``-`` | Specifies the principal of the security user used by the client. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | server-principal | ``-`` | Specifies the principal of the security user used by the server. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | compression-type | none | Specifies the compression type of data sent by Flume, which can be set to **none** or **deflate**. **none** indicates that data is not compressed, while **deflate** indicates that data is compressed. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | maxConnections | 5 | Specifies the maximum size of the connection pool for Flume to send data. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ssl | false | Specifies whether to use SSL encryption. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore-type | JKS | Specifies the Java trust store type. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore | ``-`` | Specifies the Java trust store file. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | truststore-password | ``-`` | Specifies the Java trust store password. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | reset-connection-interval | 0 | Specifies the interval between a connection failure and a second connection, expressed in seconds. If the parameter is set to **0**, the system continuously attempts to perform a connection. | + +---------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Precautions +----------- + +- What are the reliability measures of Flume? + + - Use the transaction mechanisms between Source and Channel as well as between Channel and Sink. + + - Configure the failover and load_balance mechanisms for Sink Processor. The following shows a load balancing example. + + .. code-block:: + + server.sinkgroups=g1 + server.sinkgroups.g1.sinks=k1 k2 + server.sinkgroups.g1.processor.type=load_balance + server.sinkgroups.g1.processor.backoff=true + server.sinkgroups.g1.processor.selector=random + +- What are the precautions for the aggregation and cascading of multiple Flume agents? + + - Avro or Thrift protocol can be used for cascading. + - When the aggregation end contains multiple nodes, evenly distribute the agents and do not aggregate all agents on a single node. diff --git a/doc/component-operation-guide/source/using_flume/index.rst b/doc/component-operation-guide/source/using_flume/index.rst new file mode 100644 index 0000000..afa40c1 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/index.rst @@ -0,0 +1,50 @@ +:original_name: mrs_01_0390.html + +.. _mrs_01_0390: + +Using Flume +=========== + +- :ref:`Using Flume from Scratch ` +- :ref:`Overview ` +- :ref:`Installing the Flume Client ` +- :ref:`Viewing Flume Client Logs ` +- :ref:`Stopping or Uninstalling the Flume Client ` +- :ref:`Using the Encryption Tool of the Flume Client ` +- :ref:`Flume Service Configuration Guide ` +- :ref:`Flume Configuration Parameter Description ` +- :ref:`Using Environment Variables in the properties.properties File ` +- :ref:`Non-Encrypted Transmission ` +- :ref:`Encrypted Transmission ` +- :ref:`Viewing Flume Client Monitoring Information ` +- :ref:`Connecting Flume to Kafka in Security Mode ` +- :ref:`Connecting Flume with Hive in Security Mode ` +- :ref:`Configuring the Flume Service Model ` +- :ref:`Introduction to Flume Logs ` +- :ref:`Flume Client Cgroup Usage Guide ` +- :ref:`Secondary Development Guide for Flume Third-Party Plug-ins ` +- :ref:`Common Issues About Flume ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + using_flume_from_scratch + overview + installing_the_flume_client/index + viewing_flume_client_logs + stopping_or_uninstalling_the_flume_client + using_the_encryption_tool_of_the_flume_client + flume_service_configuration_guide + flume_configuration_parameter_description + using_environment_variables_in_the_properties.properties_file + non-encrypted_transmission/index + encrypted_transmission/index + viewing_flume_client_monitoring_information + connecting_flume_to_kafka_in_security_mode + connecting_flume_with_hive_in_security_mode + configuring_the_flume_service_model/index + introduction_to_flume_logs + flume_client_cgroup_usage_guide + secondary_development_guide_for_flume_third-party_plug-ins + common_issues_about_flume diff --git a/doc/component-operation-guide/source/using_flume/installing_the_flume_client/index.rst b/doc/component-operation-guide/source/using_flume/installing_the_flume_client/index.rst new file mode 100644 index 0000000..0ed878c --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/installing_the_flume_client/index.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_0392.html + +.. _mrs_01_0392: + +Installing the Flume Client +=========================== + +- :ref:`Installing the Flume Client on Clusters of Versions Earlier Than MRS 3.x ` +- :ref:`Installing the Flume Client on MRS 3.x or Later Clusters ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + installing_the_flume_client_on_clusters_of_versions_earlier_than_mrs_3.x + installing_the_flume_client_on_mrs_3.x_or_later_clusters diff --git a/doc/component-operation-guide/source/using_flume/installing_the_flume_client/installing_the_flume_client_on_clusters_of_versions_earlier_than_mrs_3.x.rst b/doc/component-operation-guide/source/using_flume/installing_the_flume_client/installing_the_flume_client_on_clusters_of_versions_earlier_than_mrs_3.x.rst new file mode 100644 index 0000000..b382eeb --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/installing_the_flume_client/installing_the_flume_client_on_clusters_of_versions_earlier_than_mrs_3.x.rst @@ -0,0 +1,142 @@ +:original_name: mrs_01_1594.html + +.. _mrs_01_1594: + +Installing the Flume Client on Clusters of Versions Earlier Than MRS 3.x +======================================================================== + +Scenario +-------- + +To use Flume to collect logs, you must install the Flume client on a log host. You can create an ECS and install the Flume client on it. + +This section applies to MRS 3.\ *x* or earlier clusters. + +Prerequisites +------------- + +- A streaming cluster with the Flume component has been created. +- The log host is in the same VPC and subnet with the MRS cluster. +- You have obtained the username and password for logging in to the log host. + +Procedure +--------- + +#. Create an ECS that meets the requirements. + +#. Go to the cluster details page. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager and choose **Services**. + - For MRS 1.9.2 or later, click the cluster name on the MRS console and choose **Components**. + +#. .. _mrs_01_1594__li1514145518420: + + Click **Download Client**. + + a. In **Client Type**, select **All client files**. + + b. In **Download to**, select **Remote host**. + + c. Set **Host IP Address** to the IP address of the ECS, **Host Port** to **22**, and **Save Path** to **/home/linux**. + + - If the default port **22** for logging in to an ECS through SSH has been changed, set **Host Port** to a new port. + - The value of **Save Path** contains a maximum of 256 characters. + + d. Set **Login User** to **root**. + + If another user is used, ensure that the user has permissions to read, write, and execute the save path. + + e. In **SSH Private Key**, select and upload the key file used for creating the cluster. + + f. Click **OK** to generate a client file. + + If the following information is displayed, the client package is saved. + + .. code-block:: text + + Client files downloaded to the remote host successfully. + + If the following information is displayed, check the username, password, and security group configurations of the remote host. Ensure that the username and password are correct and an inbound rule of the SSH (22) port has been added to the security group of the remote host. And then, go to :ref:`3 ` to download the client again. + + .. code-block:: text + + Failed to connect to the server. Please check the network connection or parameter settings. + +#. Choose **Flume** > **Instance**. Query the **Business IP Address** of any Flume instance and any two MonitorServer instances. + +#. Log in to the ECS using VNC. See section "Login Using VNC" in the *Elastic Cloud Service User Guide* (**Instances** > **Logging In to a Linux ECS** > **Login Using VNC**. + + Log in to the ECS using an SSH key by referring to `Login Using an SSH Key `__ and set the password. Then log in to the ECS using VNC. + +#. On the ECS, switch to user **root** and copy the installation package to the **/opt** directory. + + **sudo su - root** + + **cp /home/linux/MRS_Flume_Client.tar /opt** + +#. Run the following command in the **/opt** directory to decompress the package and obtain the verification file and the configuration package of the client: + + **tar -xvf MRS_Flume_Client.tar** + +#. Run the following command to verify the configuration package of the client: + + **sha256sum -c MRS_Flume_ClientConfig.tar.sha256** + + If the following information is displayed, the file package is successfully verified: + + .. code-block:: + + MRS_Flume_ClientConfig.tar: OK + +#. Run the following command to decompress **MRS_Flume_ClientConfig.tar**: + + **tar -xvf MRS_Flume_ClientConfig.tar** + +#. Run the following command to install the client running environment to a new directory, for example, **/opt/Flumeenv**. A directory is automatically generated during the client installation. + + **sh /opt/MRS_Flume_ClientConfig/install.sh /opt/Flumeenv** + + If the following information is displayed, the client running environment is successfully installed: + + .. code-block:: + + Components client installation is complete. + +#. Run the following command to configure environment variables: + + **source /opt/Flumeenv/bigdata_env** + +#. Run the following commands to decompress the Flume client package: + + **cd /opt/MRS_Flume_ClientConfig/Flume** + + **tar -xvf FusionInsight-Flume-1.6.0.tar.gz** + +#. Run the following command to check whether the password of the current user has expired: + + **chage -l root** + + If the value of **Password expires** is earlier than the current time, the password has expired. Run the **chage -M -1 root** command to validate the password. + +#. Run the following command to install the Flume client to a new directory, for example, **/opt/FlumeClient**. A directory is automatically generated during the client installation. + + **sh /opt/MRS_Flume_ClientConfig/Flume/install.sh -d /opt/FlumeClient -f** *service IP address of the MonitorServer instance* **-c** *path of the Flume configuration file* **-l /var/log/ -e** *service IP address of Flume* **-n** *name of the Flume client* + + The parameters are described as follows: + + - **-d**: indicates the installation path of the Flume client. + - (Optional) **-f**: indicates the service IP addresses of the two MonitorServer instances, separated by a comma (,). If the IP addresses are not configured, the Flume client will not send alarm information to MonitorServer, and the client information will not be displayed on MRS Manager. + - (Optional) **-c**: indicates the **properties.properties** configuration file that the Flume client loads after installation. If this parameter is not specified, the **fusioninsight-flume-1.6.0/conf/properties.properties** file in the client installation directory is used by default. The configuration file of the client is empty. You can modify the configuration file as required and the Flume client will load it automatically. + - (Optional) **-l**: indicates the log directory. The default value is **/var/log/Bigdata**. + - (Optional) **-e**: indicates the service IP address of the Flume instance. It is used to receive the monitoring indicators reported by the client. + - (Optional) **-n**: indicates the name of the Flume client. + - IBM JDK does not support **-Xloggc**. You must change **-Xloggc** to **-Xverbosegclog** in **flume/conf/flume-env.sh**. For 32-bit JDK, the value of **-Xmx** must not exceed 3.25 GB. + - In **flume/conf/flume-env.sh**, the default value of **-Xmx** is 4 GB. If the client memory is too small, you can change it to 512 MB or even 1 GB. + + For example, run **sh install.sh -d /opt/FlumeClient**. + + If the following information is displayed, the client is successfully installed: + + .. code-block:: + + install flume client successfully. diff --git a/doc/component-operation-guide/source/using_flume/installing_the_flume_client/installing_the_flume_client_on_mrs_3.x_or_later_clusters.rst b/doc/component-operation-guide/source/using_flume/installing_the_flume_client/installing_the_flume_client_on_mrs_3.x_or_later_clusters.rst new file mode 100644 index 0000000..b3514ed --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/installing_the_flume_client/installing_the_flume_client_on_mrs_3.x_or_later_clusters.rst @@ -0,0 +1,92 @@ +:original_name: mrs_01_1595.html + +.. _mrs_01_1595: + +Installing the Flume Client on MRS 3.\ *x* or Later Clusters +============================================================ + +Scenario +-------- + +To use Flume to collect logs, you must install the Flume client on a log host. You can create an ECS and install the Flume client on it. + +This section applies to MRS 3.\ *x* or later clusters. + +Prerequisites +------------- + +- A cluster with the Flume component has been created. +- The log host is in the same VPC and subnet with the MRS cluster. +- You have obtained the username and password for logging in to the log host. +- The installation directory is automatically created if it does not exist. If it exists, the directory must be left blank. The directory path cannot contain any space. + +Procedure +--------- + +#. Obtain the software package. + + Log in to the FusionInsight Manager. Choose **Cluster** > *Name of the target cluster* > **Services** > **Flume**. On the Flume service page that is displayed, choose **More** > **Download Client** in the upper right corner and set **Select Client Type** to **Complete Client** to download the Flume service client file. + + The file name of the client is **FusionInsight_Cluster\_**\ <*Cluster ID*>\ **\_Flume_Client.tar**. This section takes the client file **FusionInsight_Cluster_1_Flume_Client.tar** as an example. + +#. Upload the software package. + + Upload the software package to a directory, for example, **/opt/client** on the node where the Flume service client will be installed as user **user**. + + .. note:: + + **user** is the user who installs and runs the Flume client. + +#. Decompress the software package. + + Log in to the node where the Flume service client is to be installed as user **user**. Go to the directory where the installation package is installed, for example, **/opt/client**, and run the following command to decompress the installation package to the current directory: + + **cd /opt/client** + + **tar -xvf FusionInsight\_Cluster_1_Flume_Client.tar** + +#. Verify the software package. + + Run the **sha256sum -c** command to verify the decompressed file. If **OK** is returned, the verification is successful. Example: + + **sha256sum -c FusionInsight\_Cluster_1_Flume_ClientConfig.tar.sha256** + + .. code-block:: + + FusionInsight_Cluster_1_Flume_ClientConfig.tar: OK + +#. Decompress the package. + + **tar -xvf FusionInsight\_Cluster_1_Flume_ClientConfig.tar** + +#. Run the following command in the Flume client installation directory to install the client to a specified directory (for example, **opt/FlumeClient**): After the client is installed successfully, the installation is complete. + + **cd /opt/client/FusionInsight\_Cluster_1_Flume_ClientConfig/Flume/FlumeClient** + + **./install.sh -d /**\ *opt/FlumeClient* **-f** *MonitorServerService IP address or host name of the role* **-c** *User service configuration filePath for storing properties.properties* **-s** *CPU threshold* **-l /var/log/Bigdata -e** *FlumeServer service IP address or host name* **-n** *Flume* + + .. note:: + + - **-d**: Flume client installation path + + - (Optional) **-f**: IP addresses or host names of two MonitorServer roles. The IP addresses or host names are separated by commas (,). If this parameter is not configured, the Flume client does not send alarm information to MonitorServer and information about the client cannot be viewed on the FusionInsight Manager GUI. + + - (Optional) **-c**: Service configuration file, which needs to be generated by the user based on the service. For details about how to generate the file on the configuration tool page of the Flume server, see :ref:`Flume Service Configuration Guide `. Upload the file to any directory on the node where the client is to be installed. If this parameter is not specified during the installation, you can upload the generated service configuration file **properties.properties** to the **/opt/FlumeClient/fusioninsight-flume-1.9.0/conf** directory after the installation. + + - (Optional) **-s**: cgroup threshold. The value is an integer ranging from 1 to 100 x *N*. *N* indicates the number of CPU cores. The default threshold is **-1**, indicating that the processes added to the cgroup are not restricted by the CPU usage. + + - (Optional) **-l**: Log path. The default value is **/var/log/Bigdata**. The user **user** must have the write permission on the directory. When the client is installed for the first time, a subdirectory named **flume-client** is generated. After the installation, subdirectories named **flume-client-**\ *n* will be generated in sequence. The letter *n* indicates a sequence number, which starts from 1 in ascending order. In the **/conf/** directory of the Flume client installation directory, modify the **ENV_VARS** file and search for the **FLUME_LOG_DIR** attribute to view the client log path. + + - (Optional) **-e**: Service IP address or host name of FlumeServer, which is used to receive statistics for the monitoring indicator reported by the client. + + - (Optional) **-n**: Name of the Flume client. You can choose **Cluster** > *Name of the desired cluster* > **Service** > **Flume** > **Flume Management** on FusionInsight Manager to view the client name on the corresponding node. + + - If the following error message is displayed, run the **export JAVA_HOME=\ JDK path** command. + + .. code-block:: + + JAVA_HOME is null in current user,please install the JDK and set the JAVA_HOME + + - IBM JDK does not support **-Xloggc**. You must change **-Xloggc** to **-Xverbosegclog** in **flume/conf/flume-env.sh**. For 32-bit JDK, the value of **-Xmx** must not exceed 3.25 GB. + + - When installing a cross-platform client in a cluster, go to the **/opt/client/FusionInsight_Cluster_1_Flume_ClientConfig/Flume/FusionInsight-Flume-1.9.0.tar.gz** directory to install the Flume client. diff --git a/doc/component-operation-guide/source/using_flume/introduction_to_flume_logs.rst b/doc/component-operation-guide/source/using_flume/introduction_to_flume_logs.rst new file mode 100644 index 0000000..30b193e --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/introduction_to_flume_logs.rst @@ -0,0 +1,94 @@ +:original_name: mrs_01_1081.html + +.. _mrs_01_1081: + +Introduction to Flume Logs +========================== + +Log Description +--------------- + +**Log path**: The default path of Flume log files is **/var/log/Bigdata/**\ *Role name*. + +- FlumeServer: **/var/log/Bigdata/flume/flume** +- FlumeClient: **/var/log/Bigdata/flume-client-n/flume** +- MonitorServer: **/var/log/Bigdata/flume/monitor** + +**Log archive rule**: The automatic Flume log compression function is enabled. By default, when the size of logs exceeds 50 MB , logs are automatically compressed into a log file named in the following format: *-.[ID]*\ **.log.zip**. A maximum of 20 latest compressed files are reserved. The number of compressed files can be configured on the Manager portal. + +.. table:: **Table 1** Flume log list + + +----------+-------------------------------------+---------------------------------------------------------------------+ + | Type | Name | Description | + +==========+=====================================+=====================================================================+ + | Run logs | /flume/flumeServer.log | Log file that records FlumeServer running environment information. | + +----------+-------------------------------------+---------------------------------------------------------------------+ + | | /flume/install.log | FlumeServer installation log file | + +----------+-------------------------------------+---------------------------------------------------------------------+ + | | /flume/flumeServer-gc.log.\ ** | GC log file of the FlumeServer process | + +----------+-------------------------------------+---------------------------------------------------------------------+ + | | /flume/prestartDvietail.log | Work log file before the FlumeServer startup | + +----------+-------------------------------------+---------------------------------------------------------------------+ + | | /flume/startDetail.log | Startup log file of the Flume process | + +----------+-------------------------------------+---------------------------------------------------------------------+ + | | /flume/stopDetail.log | Shutdown log file of the Flume process | + +----------+-------------------------------------+---------------------------------------------------------------------+ + | | /monitor/monitorServer.log | Log file that records MonitorServer running environment information | + +----------+-------------------------------------+---------------------------------------------------------------------+ + | | /monitor/startDetail.log | Startup log file of the MonitorServer process | + +----------+-------------------------------------+---------------------------------------------------------------------+ + | | /monitor/stopDetail.log | Shutdown log file of the MonitorServer process | + +----------+-------------------------------------+---------------------------------------------------------------------+ + | | function.log | External function invoking log file | + +----------+-------------------------------------+---------------------------------------------------------------------+ + +Log Level +--------- + +:ref:`Table 2 ` describes the log levels supported by Flume. + +Levels of run logs are FATAL, ERROR, WARN, INFO, and DEBUG from the highest to the lowest priority. Run logs of equal or higher levels are recorded. The higher the specified log level, the fewer the logs recorded. + +.. _mrs_01_1081__tc09b739e3eb34797a6da936a37654e97: + +.. table:: **Table 2** Log level + + +---------+-------+------------------------------------------------------------------------------------------+ + | Type | Level | Description | + +=========+=======+==========================================================================================+ + | Run log | FATAL | Logs of this level record critical error information about system running. | + +---------+-------+------------------------------------------------------------------------------------------+ + | | ERROR | Logs of this level record error information about system running. | + +---------+-------+------------------------------------------------------------------------------------------+ + | | WARN | Logs of this level record exception information about the current event processing. | + +---------+-------+------------------------------------------------------------------------------------------+ + | | INFO | Logs of this level record normal running status information about the system and events. | + +---------+-------+------------------------------------------------------------------------------------------+ + | | DEBUG | Logs of this level record the system information and system debugging information. | + +---------+-------+------------------------------------------------------------------------------------------+ + +To modify log levels, perform the following operations: + +#. Go to the **All Configurations** page of Flume by referring to :ref:`Modifying Cluster Service Configuration Parameters `. +#. On the menu bar on the left, select the log menu of the target role. +#. Select a desired log level. +#. Save the configuration. In the displayed dialog box, click **OK** to make the configurations take effect. + +.. note:: + + The configurations take effect immediately without the need to restart the service. + +Log Format +---------- + +The following table lists the Flume log formats. + +.. table:: **Table 3** Log format + + +----------+--------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ + | Type | Format | Example | + +==========+========================================================================================================================================================+==================================================================================================================================================+ + | Run logs | <*yyyy-MM-dd HH:mm:ss,SSS*>|<*Log level*>|<*Name of the thread that generates the log*>|<*Message in the log*>|<*Location where the log event occurs*> | 2014-12-12 11:54:57,316 \| INFO \| [main] \| log4j dynamic load is start. \| org.apache.flume.tools.LogDynamicLoad.start(LogDynamicLoad.java:59) | + +----------+--------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ + | | <*yyyy-MM-dd HH:mm:ss,SSS*><*Username*><*User IP*><*Time*><*Operation*><*Resource*><*Result*><*Detail>* | 2014-12-12 23:04:16,572 \| INFO \| [SinkRunner-PollingRunner-DefaultSinkProcessor] \| SRCIP=null OPERATION=close | + +----------+--------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/configuring_non-encrypted_transmission.rst b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/configuring_non-encrypted_transmission.rst new file mode 100644 index 0000000..29b3dcb --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/configuring_non-encrypted_transmission.rst @@ -0,0 +1,108 @@ +:original_name: mrs_01_1060.html + +.. _mrs_01_1060: + +Configuring Non-encrypted Transmission +====================================== + +Scenario +-------- + +This section describes how to configure Flume server and client parameters after the cluster and the Flume service are installed to ensure proper running of the service. + +This section applies to MRS 3.\ *x* or later clusters. + +.. note:: + + By default, the cluster network environment is secure and the SSL authentication is not enabled during the data transmission process. For details about how to use the encryption mode, see :ref:`Configuring the Encrypted Transmission `. + +Prerequisites +------------- + +- The cluster and Flume service have been installed. +- The network environment of the cluster is secure. + +Procedure +--------- + +#. Configure the client parameters of the Flume role. + + a. Use the Flume configuration tool on FusionInsight Manager to configure the Flume role client parameters and generate a configuration file. + + #. Log in to FusionInsight Manager. Choose **Cluster** > **Services** > **Flume** > **Configuration Tool**. + + #. Set **Agent Name** to **client**. Select and drag the source, channel, and sink to be used to the GUI on the right, and connect them. + + For example, use SpoolDir Source, File Channel, and Avro Sink. + + #. Double-click the source, channel, and sink. Set corresponding configuration parameters by referring to :ref:`Table 1 ` based on the actual environment. + + .. note:: + + - If the client parameters of the Flume role have been configured, you can obtain the existing client parameter configuration file from *client installation directory*\ **/fusioninsight-flume-1.9.0/conf/properties.properties** to ensure that the configuration is in concordance with the previous. Log in to FusionInsight Manager, choose **Cluster** > **Services** > **Flume** > **Configuration** > **Import**, import the file, and modify the configuration items related to non-encrypted transmission. + - It is recommended that the numbers of Sources, Channels, and Sinks do not exceed 40 during configuration file import. Otherwise, the response time may be very long. + + .. _mrs_01_1060__table1953421416017: + + .. table:: **Table 1** Parameters to be modified for the Flume role client + + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Example Value | + +=======================+===================================================================================================================+=======================+ + | ssl | Specifies whether to enable the SSL authentication. (You are advised to enable this function to ensure security.) | false | + | | | | + | | Only Sources of the Avro type have this configuration item. | | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the function is not enabled. | | + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-----------------------+ + + #. Click **Export** to save the **properties.properties** configuration file to the local server. + + b. Upload the **properties.properties** file to **flume/conf/** under the installation directory of the Flume client. + +#. Configure the server parameters of the Flume role and upload the configuration file to the cluster. + + a. Use the Flume configuration tool on the FusionInsight Manager portal to configure the server parameters and generate the configuration file. + + #. Log in to FusionInsight Manager. Choose **Cluster** > **Services** > **Flume** > **Configuration Tool**. + + #. Set **Agent Name** to **server**. Select and drag the source, channel, and sink to be used to the GUI on the right, and connect them. + + For example, use Avro Source, File Channel, and HDFS Sink. + + #. Double-click the source, channel, and sink. Set corresponding configuration parameters by referring to :ref:`Table 2 ` based on the actual environment. + + .. note:: + + - If the server parameters of the Flume role have been configured, you can choose **Cluster** > **Services** > **Flume** > **Instance** on FusionInsight Manager. Then select the corresponding Flume role instance and click the **Download** button behind the **flume.config.file** parameter on the **Instance Configurations** page to obtain the existing server parameter configuration file. Choose **Cluster** > **Service** > **Flume** > **Configurations** > **Import**, import the file, and modify the configuration items related to non-encrypted transmission. + - It is recommended that the numbers of Sources, Channels, and Sinks do not exceed 40 during configuration file import. Otherwise, the response time may be very long. + - A unique checkpoint directory needs to be configured for each File Channel. + + .. _mrs_01_1060__table1324711382017: + + .. table:: **Table 2** Parameters to be modified for the Flume role server + + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Example Value | + +=======================+===================================================================================================================+=======================+ + | ssl | Specifies whether to enable the SSL authentication. (You are advised to enable this function to ensure security.) | false | + | | | | + | | Only Sources of the Avro type have this configuration item. | | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the function is not enabled. | | + +-----------------------+-------------------------------------------------------------------------------------------------------------------+-----------------------+ + + #. Click **Export** to save the **properties.properties** configuration file to the local server. + + b. Log in to FusionInsight Manager and choose **Cluster** > **Services** > **Flume**. On the **Instances** tab page, click **Flume**. + c. Select the Flume role of the node where the configuration file is to be uploaded, choose **Instance Configurations** > **Import** beside the **flume.config.file**, and select the **properties.properties** file. + + .. note:: + + - An independent server configuration file can be uploaded to each Flume instance. + - This step is required for updating the configuration file. Modifying the configuration file on the background is an improper operation because the modification will be overwritten after configuration synchronization. + + d. Click **Save**, and then click **OK**. + e. Click **Finish**. diff --git a/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/index.rst b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/index.rst new file mode 100644 index 0000000..ef72c8c --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/index.rst @@ -0,0 +1,26 @@ +:original_name: mrs_01_1059.html + +.. _mrs_01_1059: + +Non-Encrypted Transmission +========================== + +- :ref:`Configuring Non-encrypted Transmission ` +- :ref:`Typical Scenario: Collecting Local Static Logs and Uploading Them to Kafka ` +- :ref:`Typical Scenario: Collecting Local Static Logs and Uploading Them to HDFS ` +- :ref:`Typical Scenario: Collecting Local Dynamic Logs and Uploading Them to HDFS ` +- :ref:`Typical Scenario: Collecting Logs from Kafka and Uploading Them to HDFS ` +- :ref:`Typical Scenario: Collecting Logs from Kafka and Uploading Them to HDFS Through the Flume Client ` +- :ref:`Typical Scenario: Collecting Local Static Logs and Uploading Them to HBase ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + configuring_non-encrypted_transmission + typical_scenario_collecting_local_static_logs_and_uploading_them_to_kafka + typical_scenario_collecting_local_static_logs_and_uploading_them_to_hdfs + typical_scenario_collecting_local_dynamic_logs_and_uploading_them_to_hdfs + typical_scenario_collecting_logs_from_kafka_and_uploading_them_to_hdfs + typical_scenario_collecting_logs_from_kafka_and_uploading_them_to_hdfs_through_the_flume_client + typical_scenario_collecting_local_static_logs_and_uploading_them_to_hbase diff --git a/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_local_dynamic_logs_and_uploading_them_to_hdfs.rst b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_local_dynamic_logs_and_uploading_them_to_hdfs.rst new file mode 100644 index 0000000..74d194a --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_local_dynamic_logs_and_uploading_them_to_hdfs.rst @@ -0,0 +1,95 @@ +:original_name: mrs_01_1064.html + +.. _mrs_01_1064: + +Typical Scenario: Collecting Local Dynamic Logs and Uploading Them to HDFS +========================================================================== + +Scenario +-------- + +This section describes how to use the Flume client to collect dynamic logs from a local host and save them to the **/flume/test** directory on HDFS. + +This section applies to MRS 3.\ *x* or later clusters. + +.. note:: + + By default, the cluster network environment is secure and the SSL authentication is not enabled during the data transmission process. For details about how to use the encryption mode, see :ref:`Configuring the Encrypted Transmission `. The configuration applies to scenarios where only the Flume is configured, for example, Taildir Source+Memory Channel+HDFS Sink. + +Prerequisites +------------- + +- The cluster has been installed, including the HDFS and Flume services. +- The Flume client has been installed. For details, see `Installing the Flume Client `__. +- The network environment of the cluster is secure. +- You have created user **flume_hdfs** and authorized the HDFS directory and data to be operated during log verification. + +Procedure +--------- + +#. On FusionInsight Manager, choose **System > User** and choose **More > Download Authentication Credential** to download the Kerberos certificate file of user **flume_hdfs** and save it to the local host. + +#. Set Flume parameters. + + Use the Flume configuration tool on FusionInsight Manager to configure the Flume role client parameters and generate a configuration file. + + a. Log in to FusionInsight Manager and choose **Cluster** > **Services**. On the page that is displayed, choose **Flume**. On the displayed page, click the **Configuration Tool** tab. + + b. Set **Agent Name** to **client**. Select the source, channel, and sink to be used, drag them to the GUI on the right, and connect them. + + Use Taildir Source, Memory Channel, and HDFS Sink. + + c. Double-click the source, channel, and sink. Set corresponding configuration parameters by referring to :ref:`Table 1 ` based on the actual environment. + + .. note:: + + - If you want to continue using the **properties.propretites** file by modifying it, log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services**. On the page that is displayed, choose **Flume**. On the displayed page, click the **Configuration Tool** tab, click **Import**, import the file, and modify the configuration items related to non-encrypted transmission. + - It is recommended that the numbers of Sources, Channels, and Sinks do not exceed 40 during configuration file import. Otherwise, the response time may be very long. + + .. _mrs_01_1064__table275562484: + + .. table:: **Table 1** Parameters to be modified for the Flume role client + + +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +========================+==============================================================================================================================================================================================================================================================================================================+============================================================================================================================================================================================================================================+ + | Name | The value must be unique and cannot be left blank. | test | + +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | filegroups | Specifies the file group list name. This parameter cannot be left blank. The value contains the following two parts: | ``-`` | + | | | | + | | - **Name**: name of the file group list. | | + | | - **filegroups**: absolute path of dynamic log files. | | + +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | positionFile | Specifies the location where the collected file information (file name and location from which the file collected) is saved. This parameter cannot be left blank. The file does not need to be created manually, but the Flume running user needs to have the write permission on its upper-level directory. | /home/omm/flume/positionfile | + +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchSize | Specifies the number of events that Flume sends in a batch. | 61200 | + +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.path | Specifies the HDFS data write directory. This parameter cannot be left blank. | hdfs://hacluster/flume/test | + +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.filePrefix | Specifies the file name prefix after data is written to HDFS. | TMP\_ | + +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.batchSize | Specifies the maximum number of events that can be written to HDFS once. | 61200 | + +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosPrincipal | Specifies the Kerberos authentication user, which is mandatory in security versions. This configuration is required only in security clusters. | flume_hdfs | + +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosKeytab | Specifies the keytab file path for Kerberos authentication, which is mandatory in security versions. This configuration is required only in security clusters. | /opt/test/conf/user.keytab | + | | | | + | | | .. note:: | + | | | | + | | | Obtain the **user.keytab** file from the Kerberos certificate file of the user **flume_hdfs**. In addition, ensure that the user who installs and runs the Flume client has the read and write permissions on the **user.keytab** file. | + +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.useLocalTimeStamp | Specifies whether to use the local time. Possible values are **true** and **false**. | true | + +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + d. .. _mrs_01_1064__l78938a30f82d4a5283b7c4aaa1bb79b1: + + Click **Export** to save the **properties.properties** configuration file to the local. + +#. Upload the configuration file. + + Upload the file exported in :ref:`2.d ` to the *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf** directory of the cluster. + +4. Verify log transmission. + + a. Log in to FusionInsight Manager as a user who has the management permission on HDFS. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Choose **Cluster** > **Services** > **HDFS**. On the page that is displayed, click the **NameNode(**\ *Node name*\ **,Active)** link next to **NameNode WebUI** to go to the HDFS web UI. On the displayed page, choose **Utilities** > **Browse the file system**. + b. Check whether the data is generated in the **/flume/test** directory on the HDFS. diff --git a/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_local_static_logs_and_uploading_them_to_hbase.rst b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_local_static_logs_and_uploading_them_to_hbase.rst new file mode 100644 index 0000000..2668445 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_local_static_logs_and_uploading_them_to_hbase.rst @@ -0,0 +1,177 @@ +:original_name: mrs_01_1067.html + +.. _mrs_01_1067: + +Typical Scenario: Collecting Local Static Logs and Uploading Them to HBase +========================================================================== + +Scenario +-------- + +This section describes how to use the Flume client to collect static logs from a local host and save them to the **flume_test** HBase table. In this scenario, multi-level agents are cascaded. + +This section applies to MRS 3.\ *x* or later clusters. + +.. note:: + + By default, the cluster network environment is secure and the SSL authentication is not enabled during the data transmission process. For details about how to use the encryption mode, see :ref:`Configuring the Encrypted Transmission `. The configuration applies to scenarios where only the server is configured, for example, Spooldir Source+File Channel+HBase Sink. + +Prerequisites +------------- + +- The cluster has been installed, including the HBase and Flume services. +- The Flume client has been installed. For details, see `Installing the Flume Client `__. +- The network environment of the cluster is secure. +- An HBase table has been created by running the **create 'flume_test', 'cf'** command. +- The system administrator has understood service requirements and prepared HBase administrator **flume_hbase**. + +Procedure +--------- + +#. On FusionInsight Manager, choose **System > User** and choose **More > Download Authentication Credential** to download the Kerberos certificate file of user **flume_hbase** and save it to the local host. +#. Configure the client parameters of the Flume role. + + a. Use the Flume configuration tool on FusionInsight Manager to configure the Flume role client parameters and generate a configuration file. + + #. Log in to FusionInsight Manager and choose **Cluster** > **Services**. On the page that is displayed, choose **Flume**. On the displayed page, click the **Configuration Tool** tab. + + #. Set **Agent Name** to **client**. Select the source, channel, and sink to be used, drag them to the GUI on the right, and connect them. + + Use SpoolDir Source, File Channel, and Avro Sink. + + #. Double-click the source, channel, and sink. Set corresponding configuration parameters by referring to :ref:`Table 1 ` based on the actual environment. + + .. note:: + + - If you want to continue using the **properties.propretites** file by modifying it, log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services**. On the page that is displayed, choose **Flume**. On the displayed page, click the **Configuration Tool** tab, click **Import**, import the file, and modify the configuration items related to non-encrypted transmission. + - It is recommended that the numbers of Sources, Channels, and Sinks do not exceed 40 during configuration file import. Otherwise, the response time may be very long. + + .. _mrs_01_1067__table15735429105511: + + .. table:: **Table 1** Parameters to be modified for the Flume role client + + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+ + | Parameter | Description | Example Value | + +=======================+=========================================================================================================================================================================================================================================================================================================================================================================================================================================================+============================================+ + | Name | The value must be unique and cannot be left blank. | test | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+ + | spoolDir | Specifies the directory where the file to be collected resides. This parameter cannot be left blank. The directory needs to exist and have the write, read, and execute permissions on the flume running user. | /srv/BigData/hadoop/data1/zb | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+ + | trackerDir | Specifies the path for storing the metadata of files collected by Flume. | /srv/BigData/hadoop/data1/tracker | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+ + | batchSize | Specifies the number of events that Flume sends in a batch (number of data pieces). A larger value indicates higher performance and lower timeliness. | 61200 | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+ + | dataDirs | Specifies the directory for storing buffer data. The run directory is used by default. Configuring multiple directories on disks can improve transmission efficiency. Use commas (,) to separate multiple directories. If the directory is inside the cluster, the **/srv/BigData/hadoop/dataX/flume/data** directory can be used. **dataX** ranges from data1 to dataN. If the directory is outside the cluster, it needs to be independently planned. | /srv/BigData/hadoop/data1/flume/data | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+ + | checkpointDir | Specifies the directory for storing the checkpoint information, which is under the run directory by default. If the directory is inside the cluster, the **/srv/BigData/hadoop/dataX/flume/checkpoint** directory can be used. **dataX** ranges from data1 to dataN. If the directory is outside the cluster, it needs to be independently planned. | /srv/BigData/hadoop/data1/flume/checkpoint | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+ + | transactionCapacity | Specifies the transaction size, that is, the number of events in a transaction that can be processed by the current Channel. The size cannot be smaller than the batchSize of Source. Setting the same size as batchSize is recommended. | 61200 | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+ + | hostname | Specifies the name or IP address of the host whose data is to be sent. This parameter cannot be left blank. Name or IP address must be configured to be the name or IP address that the Avro source associated with it. | 192.168.108.11 | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+ + | port | Specifies the port that sends the data. This parameter cannot be left blank. It must be consistent with the port that is monitored by the connected Avro Source. | 21154 | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+ + | ssl | Specifies whether to enable the SSL authentication. (You are advised to enable this function to ensure security.) | false | + | | | | + | | Only Sources of the Avro type have this configuration item. | | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+ + + #. Click **Export** to save the **properties.properties** configuration file to the local. + + b. Upload the **properties.properties** file to **flume/conf/** under the installation directory of the Flume client. + +#. Configure the server parameters of the Flume role and upload the configuration file to the cluster. + + a. Use the Flume configuration tool on the FusionInsight Manager portal to configure the server parameters and generate the configuration file. + + #. Log in to FusionInsight Manager and choose **Cluster** > **Services**. On the page that is displayed, choose **Flume**. On the displayed page, click the **Configuration Tool** tab. + + #. Set **Agent Name** to **server**. Select the source, channel, and sink to be used, drag them to the GUI on the right, and connect them. + + For example, use Avro Source, File Channel, and HBase Sink. + + #. Double-click the source, channel, and sink. Set corresponding configuration parameters by referring to :ref:`Table 2 ` based on the actual environment. + + .. note:: + + - If the server parameters of the Flume role have been configured, you can choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume** > **Instance** on FusionInsight Manager. Then select the corresponding Flume role instance and click the **Download** button behind the **flume.config.file** parameter on the **Instance Configurations** page to obtain the existing server parameter configuration file. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume** > **Configuration Tool** > **Import**, import the file, and modify the configuration items related to non-encrypted transmission. + - It is recommended that the numbers of Sources, Channels, and Sinks do not exceed 40 during configuration file import. Otherwise, the response time may be very long. + - A unique checkpoint directory needs to be configured for each File Channel. + + .. _mrs_01_1067__table77819014563: + + .. table:: **Table 2** Parameters to be modified for the Flume role server + + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +=======================+=========================================================================================================================================================================================================================================================================================================================================================================================================================================================+=============================================================================================================================================================================================================================================+ + | Name | The value must be unique and cannot be left blank. | test | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | bind | Specifies the IP address to which Avro Source is bound. This parameter cannot be left blank. It must be configured as the IP address that the server configuration file will upload. | 192.168.108.11 | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | port | Specifies the ID of the port that the Avro Source monitors. This parameter cannot be left blank. It must be configured as an unused port. | 21154 | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ssl | Specifies whether to enable the SSL authentication. (You are advised to enable this function to ensure security.) | false | + | | | | + | | Only Sources of the Avro type have this configuration item. | | + | | | | + | | - **true** indicates that the function is enabled. | | + | | - **false** indicates that the client authentication function is not enabled. | | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | dataDirs | Specifies the directory for storing buffer data. The run directory is used by default. Configuring multiple directories on disks can improve transmission efficiency. Use commas (,) to separate multiple directories. If the directory is inside the cluster, the **/srv/BigData/hadoop/dataX/flume/data** directory can be used. **dataX** ranges from data1 to dataN. If the directory is outside the cluster, it needs to be independently planned. | /srv/BigData/hadoop/data1/flumeserver/data | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | checkpointDir | Specifies the directory for storing the checkpoint information, which is under the run directory by default. If the directory is inside the cluster, the **/srv/BigData/hadoop/dataX/flume/checkpoint** directory can be used. **dataX** ranges from data1 to dataN. If the directory is outside the cluster, it needs to be independently planned. | /srv/BigData/hadoop/data1/flumeserver/checkpoint | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | transactionCapacity | Specifies the transaction size, that is, the number of events in a transaction that can be processed by the current Channel. The size cannot be smaller than the batchSize of Source. Setting the same size as batchSize is recommended. | 61200 | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | table | Specifies the HBase table name. This parameter cannot be left blank. | flume_test | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | columnFamily | Specifies the HBase column family name. This parameter cannot be left blank. | cf | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchSize | Specifies the maximum number of events written to HBase by Flume in a batch. | 61200 | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kerberosPrincipal | Specifies the Kerberos authentication user, which is mandatory in security versions. This configuration is required only in security clusters. | flume_hbase | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kerberosKeytab | Specifies the file path for Kerberos authentication, which is mandatory in security versions. This configuration is required only in security clusters. | /opt/test/conf/user.keytab | + | | | | + | | | .. note:: | + | | | | + | | | Obtain the **user.keytab** file from the Kerberos certificate file of the user **flume_hbase**. In addition, ensure that the user who installs and runs the Flume client has the read and write permissions on the **user.keytab** file. | + +-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + #. Click **Export** to save the **properties.properties** configuration file to the local. + + b. Log in to FusionInsight Manager and choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume**. On the displayed page, click the **Flume** role on the **Instance** tab page. + c. Select the Flume role of the node where the configuration file is to be uploaded, choose **Instance Configurations** > **Import** beside the **flume.config.file**, and select the **properties.properties** file. + + .. note:: + + - An independent server configuration file can be uploaded to each Flume instance. + - This step is required for updating the configuration file. Modifying the configuration file on the background is an improper operation because the modification will be overwritten after configuration synchronization. + + d. Click **Save**, and then click **OK**. + e. Click **Finish**. + +4. Verify log transmission. + + a. Go to the directory where the HBase client is installed. + + **cd /**\ *Client installation directory*\ **/ HBase/hbase** + + **kinit flume_hbase** (Enter the password.) + + b. Run the **hbase shell** command to access the HBase client. + + c. Run the **scan 'flume_test'** statement. Logs are written in the HBase column family by line. + + .. code-block:: + + hbase(main):001:0> scan 'flume_test' + ROW COLUMN+CELL + 2017-09-18 16:05:36,394 INFO [hconnection-0x415a3f6a-shared--pool2-t1] ipc.AbstractRpcClient: RPC Server Kerberos principal name for service=ClientService is hbase/hadoop.@ + default4021ff4a-9339-4151-a4d0-00f20807e76d column=cf:pCol, timestamp=1505721909388, value=Welcome to flume + incRow column=cf:iCol, timestamp=1505721909461, value=\x00\x00\x00\x00\x00\x00\x00\x01 + 2 row(s) in 0.3660 seconds diff --git a/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_local_static_logs_and_uploading_them_to_hdfs.rst b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_local_static_logs_and_uploading_them_to_hdfs.rst new file mode 100644 index 0000000..614be9e --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_local_static_logs_and_uploading_them_to_hdfs.rst @@ -0,0 +1,92 @@ +:original_name: mrs_01_1063.html + +.. _mrs_01_1063: + +Typical Scenario: Collecting Local Static Logs and Uploading Them to HDFS +========================================================================= + +Scenario +-------- + +This section describes how to use the Flume client to collect static logs from a local host and save them to the **/flume/test** directory on HDFS. + +This section applies to MRS 3.\ *x* or later clusters. + +.. note:: + + By default, the cluster network environment is secure and the SSL authentication is not enabled during the data transmission process. For details about how to use the encryption mode, see :ref:`Configuring the Encrypted Transmission `. The configuration applies to scenarios where only the Flume is configured, for example, Spooldir Source+Memory Channel+HDFS Sink. + +Prerequisites +------------- + +- The cluster has been installed, including the HDFS and Flume services. +- The Flume client has been installed. For details, see `Installing the Flume Client `__. +- The network environment of the cluster is secure. +- User **flume_hdfs** has been created, and the HDFS directory and data used for log verification have been authorized to the user. + +Procedure +--------- + +#. On FusionInsight Manager, choose **System** > **Permission > User**, select user **flume_hdfs**, and choose **More** > **Download Authentication Credential** to download the Kerberos certificate file of user **flume_hdfs** and save it to the local host. + +#. Set Flume parameters. + + Use Flume on FusionInsight Manager to configure the Flume role client parameters and generate a configuration file. + + a. Log in to FusionInsight Manager. Choose **Cluster** > **Services** > **Flume** > **Configuration Tool**. + + b. Set **Agent Name** to **client**. Select the source, channel, and sink to be used, drag them to the GUI on the right, and connect them. + + Use SpoolDir Source, Memory Channel, and HDFS Sink. + + c. Double-click the source, channel, and sink. Set corresponding configuration parameters by referring to :ref:`Table 1 ` based on the actual environment. + + .. note:: + + - If you want to continue using the **properties.propretites** file by modifying it, log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services**. On the page that is displayed, choose **Flume**. On the displayed page, click the **Configuration Tool** tab, click **Import**, import the file, and modify the configuration items related to non-encrypted transmission. + - It is recommended that the numbers of Sources, Channels, and Sinks do not exceed 40 during configuration file import. Otherwise, the response time may be very long. + + .. _mrs_01_1063__table10698142517481: + + .. table:: **Table 1** Parameters to be modified for the Flume role client + + +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +========================+================================================================================================================================================================================================================+============================================================================================================================================================================================================================================+ + | Name | The value must be unique and cannot be left blank. | test | + +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | spoolDir | Specifies the directory where the file to be collected resides. This parameter cannot be left blank. The directory needs to exist and have the write, read, and execute permissions on the flume running user. | /srv/BigData/hadoop/data1/zb | + +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | trackerDir | Specifies the path for storing the metadata of files collected by Flume. | /srv/BigData/hadoop/data1/tracker | + +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchSize | Specifies the number of events that Flume sends in a batch. | 61200 | + +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.path | Specifies the HDFS data write directory. This parameter cannot be left blank. | hdfs://hacluster/flume/test | + +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.filePrefix | Specifies the file name prefix after data is written to HDFS. | TMP\_ | + +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.batchSize | Specifies the maximum number of events that can be written to HDFS once. | 61200 | + +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosPrincipal | Specifies the Kerberos authentication user, which is mandatory in security versions. This configuration is required only in security clusters. | flume_hdfs | + +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosKeytab | Specifies the keytab file path for Kerberos authentication, which is mandatory in security versions. This configuration is required only in security clusters. | /opt/test/conf/user.keytab | + | | | | + | | | .. note:: | + | | | | + | | | Obtain the **user.keytab** file from the Kerberos certificate file of the user **flume_hdfs**. In addition, ensure that the user who installs and runs the Flume client has the read and write permissions on the **user.keytab** file. | + +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.useLocalTimeStamp | Specifies whether to use the local time. Possible values are **true** and **false**. | true | + +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + d. .. _mrs_01_1063__ld87a5f43900a41ad8cda390510028ae7: + + Click **Export** to save the **properties.properties** configuration file to the local. + +#. Upload the configuration file. + + Upload the file exported in :ref:`2.d ` to the *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf** directory of the cluster. + +4. Verify log transmission. + + a. Log in to FusionInsight Manager as a user who has the management permission on HDFS. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Choose **Cluster** > **Services** > **HDFS**. On the page that is displayed, click the **NameNode(**\ *Node name*\ **,Active)** link next to **NameNode WebUI** to go to the HDFS web UI. On the displayed page, choose **Utilities** > **Browse the file system**. + b. Check whether the data is generated in the **/flume/test** directory on the HDFS. diff --git a/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_local_static_logs_and_uploading_them_to_kafka.rst b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_local_static_logs_and_uploading_them_to_kafka.rst new file mode 100644 index 0000000..a7b5c50 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_local_static_logs_and_uploading_them_to_kafka.rst @@ -0,0 +1,92 @@ +:original_name: mrs_01_1061.html + +.. _mrs_01_1061: + +Typical Scenario: Collecting Local Static Logs and Uploading Them to Kafka +========================================================================== + +Scenario +-------- + +This section describes how to use the Flume client to collect static logs from a local host and save them to the topic list (test1) of Kafka. + +This section applies to MRS 3.\ *x* or later clusters. + +.. note:: + + By default, the cluster network environment is secure and the SSL authentication is not enabled during the data transmission process. For details about how to use the encryption mode, see :ref:`Configuring the Encrypted Transmission `. The configuration applies to scenarios where only the Flume is configured, for example, Spooldir Source+Memory Channel+Kafka Sink. + +Prerequisites +------------- + +- The cluster has been installed, including the Kafka and Flume services. +- The Flume client has been installed. For details, see `Installing the Flume Client `__. +- The network environment of the cluster is secure. +- The system administrator has understood service requirements and prepared Kafka administrator **flume_kafka**. + +Procedure +--------- + +#. Set Flume parameters. + + Use the Flume configuration tool on Manager to configure the Flume role client parameters and generate a configuration file. + + a. Log in to FusionInsight Manager. Choose **Cluster** > **Services** > **Flume** > **Configuration Tool**. + + b. Set **Agent Name** to **client**. Select and drag the source, channel, and sink to be used to the GUI on the right, and connect them. + + Use SpoolDir Source, Memory Channel, and Kafka Sink. + + c. Double-click the source, channel, and sink. Set corresponding configuration parameters by referring to :ref:`Table 1 ` based on the actual environment. + + .. note:: + + - If you want to continue using the **properties.propretites** file by modifying it, log in to FusionInsight Manager, choose **Cluster** > **Services**. On the page that is displayed, choose **Flume**. On the displayed page, click the **Configuration Tool** tab, click **Import**, import the file, and modify the configuration items related to non-encrypted transmission. + - It is recommended that the numbers of Sources, Channels, and Sinks do not exceed 40 during configuration file import. Otherwise, the response time may be very long. + + .. _mrs_01_1061__table1162101394616: + + .. table:: **Table 1** Parameters to be modified for the Flume role client + + +-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------+ + | Parameter | Description | Example Value | + +=========================+================================================================================================================================================================================================================+===================================+ + | Name | The value must be unique and cannot be left blank. | test | + +-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------+ + | spoolDir | Specifies the directory where the file to be collected resides. This parameter cannot be left blank. The directory needs to exist and have the write, read, and execute permissions on the flume running user. | /srv/BigData/hadoop/data1/zb | + +-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------+ + | trackerDir | Specifies the path for storing the metadata of files collected by Flume. | /srv/BigData/hadoop/data1/tracker | + +-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------+ + | batchSize | Specifies the number of events that Flume sends in a batch (number of data pieces). A larger value indicates higher performance and lower timeliness. | 61200 | + +-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------+ + | kafka.topics | Specifies the list of subscribed Kafka topics, which are separated by commas (,). This parameter cannot be left blank. | test1 | + +-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------+ + | kafka.bootstrap.servers | Specifies the bootstrap IP address and port list of Kafka. The default value is all Kafkabrokers in the Kafka cluster. | 192.168.101.10:21007 | + +-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------+ + + d. .. _mrs_01_1061__l14d98e844ee849a99592f46d8be65b86: + + Click **Export** to save the **properties.properties** configuration file to the local server. + +#. Upload the configuration file. + + Upload the file exported in :ref:`1.d ` to the *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf** directory of the cluster. + +3. Verify log transmission. + + a. Log in to the Kafka client. + + **cd** *Kafka client installation directory*\ **/Kafka/kafka** + + **kinit flume_kafka** (Enter the password.) + + b. Read data from a Kafka topic. + + **bin/kafka-console-consumer.sh --topic** *topic name* **--bootstrap-server** *Kafka service IP address of the node where the role instance is located*\ **: 21007 --consumer.config config/consumer.properties --from-beginning** + + The system displays the contents of the file to be collected. + + .. code-block:: console + + [root@host1 kafka]# bin/kafka-console-consumer.sh --topic test1 --bootstrap-server 192.168.101.10:21007 --consumer.config config/consumer.properties --from-beginning + Welcome to flume diff --git a/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_logs_from_kafka_and_uploading_them_to_hdfs.rst b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_logs_from_kafka_and_uploading_them_to_hdfs.rst new file mode 100644 index 0000000..31ccecd --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_logs_from_kafka_and_uploading_them_to_hdfs.rst @@ -0,0 +1,94 @@ +:original_name: mrs_01_1065.html + +.. _mrs_01_1065: + +Typical Scenario: Collecting Logs from Kafka and Uploading Them to HDFS +======================================================================= + +Scenario +-------- + +This section describes how to use the Flume client to collect logs from the topic list (test1) of Kafka and save them to the **/flume/test** directory on HDFS. + +This section applies to MRS 3.\ *x* or later clusters. + +.. note:: + + By default, the cluster network environment is secure and the SSL authentication is not enabled during the data transmission process. For details about how to use the encryption mode, see :ref:`Configuring the Encrypted Transmission `. The configuration applies to scenarios where only the Flume is configured, for example, Kafka Source+Memory Channel+HDFS Sink. + +Prerequisites +------------- + +- The cluster has been installed, including the HDFS, Kafka, and Flume services. +- The Flume client has been installed. For details, see `Installing the Flume Client `__. +- The network environment of the cluster is secure. +- You have created user **flume_hdfs** and authorized the HDFS directory and data to be operated during log verification. + +Procedure +--------- + +#. On FusionInsight Manager, choose **System > User** and choose **More > Download Authentication Credential** to download the Kerberos certificate file of user **flume_hdfs** and save it to the local host. + +#. Configure the client parameters of the Flume role. + + Use the Flume configuration tool on FusionInsight Manager to configure the Flume role client parameters and generate a configuration file. + + a. Log in to FusionInsight Manager and choose **Cluster** > **Services**. On the page that is displayed, choose **Flume**. On the displayed page, click the **Configuration Tool** tab. + + b. Set **Agent Name** to **client**. Select the source, channel, and sink to be used, drag them to the GUI on the right, and connect them. + + For example, use Kafka Source, Memory Channel, and HDFS Sink. + + c. Double-click the source, channel, and sink. Set corresponding configuration parameters by referring to :ref:`Table 1 ` based on the actual environment. + + .. note:: + + - If you want to continue using the **properties.propretites** file by modifying it, log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services**. On the page that is displayed, choose **Flume**. On the displayed page, click the **Configuration Tool** tab, click **Import**, import the file, and modify the configuration items related to non-encrypted transmission. + - It is recommended that the numbers of Sources, Channels, and Sinks do not exceed 40 during configuration file import. Otherwise, the response time may be very long. + + .. _mrs_01_1065__table2029895217498: + + .. table:: **Table 1** Parameters to be modified for the Flume role client + + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +=========================+=================================================================================================================================================================================================================================================+============================================================================================================================================================================================================================================+ + | Name | The value must be unique and cannot be left blank. | test | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.topics | Specifies the subscribed Kafka topic list, in which topics are separated by commas (,). This parameter cannot be left blank. | test1 | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.consumer.group.id | Specifies the data group ID obtained from Kafka. This parameter cannot be left blank. | flume | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.bootstrap.servers | Specifies the bootstrap IP address and port list of Kafka. The default value is all Kafka lists in a Kafka cluster. If Kafka has been installed in the cluster and its configurations have been synchronized, this parameter can be left blank. | 192.168.101.10:9092 | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchSize | Specifies the number of events that Flume sends in a batch (number of data pieces). | 61200 | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.path | Specifies the HDFS data write directory. This parameter cannot be left blank. | hdfs://hacluster/flume/test | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.filePrefix | Specifies the file name prefix after data is written to HDFS. | TMP\_ | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.batchSize | Specifies the maximum number of events that can be written to HDFS once. | 61200 | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosPrincipal | Specifies the Kerberos authentication user, which is mandatory in security versions. This configuration is required only in security clusters. | flume_hdfs | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosKeytab | Specifies the keytab file path for Kerberos authentication, which is mandatory in security versions. This configuration is required only in security clusters. | /opt/test/conf/user.keytab | + | | | | + | | | .. note:: | + | | | | + | | | Obtain the **user.keytab** file from the Kerberos certificate file of the user **flume_hdfs**. In addition, ensure that the user who installs and runs the Flume client has the read and write permissions on the **user.keytab** file. | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.useLocalTimeStamp | Specifies whether to use the local time. Possible values are **true** and **false**. | true | + +-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + d. .. _mrs_01_1065__l92b924df515f493daa8ec019ca9fcec4: + + Click **Export** to save the **properties.properties** configuration file to the local. + +#. Upload the configuration file. + + Upload the file exported in :ref:`2.d ` to the *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf** directory of the cluster. + +4. Verify log transmission. + + a. Log in to FusionInsight Manager as a user who has the management permission on HDFS. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Choose **Cluster** > **Services** > **HDFS**. On the page that is displayed, click the **NameNode(**\ *Node name*\ **,Active)** link next to **NameNode WebUI** to go to the HDFS web UI. On the displayed page, choose **Utilities** > **Browse the file system**. + b. Check whether the data is generated in the **/flume/test** directory on the HDFS. diff --git a/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_logs_from_kafka_and_uploading_them_to_hdfs_through_the_flume_client.rst b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_logs_from_kafka_and_uploading_them_to_hdfs_through_the_flume_client.rst new file mode 100644 index 0000000..7b9e3e0 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/non-encrypted_transmission/typical_scenario_collecting_logs_from_kafka_and_uploading_them_to_hdfs_through_the_flume_client.rst @@ -0,0 +1,123 @@ +:original_name: mrs_01_1066.html + +.. _mrs_01_1066: + +Typical Scenario: Collecting Logs from Kafka and Uploading Them to HDFS Through the Flume Client +================================================================================================ + +Scenario +-------- + +This section describes how to use the Flume client to collect logs from the topic list (test1) of the Kafka client and save them to the **/flume/test** directory on HDFS. + +This section applies to MRS 3.\ *x* or later clusters. + +.. note:: + + By default, the cluster network environment is secure and the SSL authentication is not enabled during the data transmission process. For details about how to use the encryption mode, see :ref:`Configuring the Encrypted Transmission `. + +Prerequisites +------------- + +- The cluster has been installed, including the HDFS, Kafka, and Flume services. +- The Flume client has been installed. For details, see `Installing the Flume Client `__. +- You have created user **flume_hdfs** and authorized the HDFS directory and data to be operated during log verification. +- The network environment of the cluster is secure. + +Procedure +--------- + +#. On FusionInsight Manager, choose **System > User** and choose **More > Download Authentication Credential** to download the Kerberos certificate file of user **flume_hdfs** and save it to the local host. +#. Configure the client parameters of the Flume role. + + a. Use the Flume configuration tool on FusionInsight Manager to configure the Flume role client parameters and generate a configuration file. + + #. Log in to FusionInsight Manager and choose **Cluster** > **Services**. On the page that is displayed, choose **Flume**. On the displayed page, click the **Configuration Tool** tab. + + #. Set **Agent Name** to **client**. Select the source, channel, and sink to be used, drag them to the GUI on the right, and connect them. + + For example, use Kafka Source, File Channel, and HDFS Sink. + + #. Double-click the source, channel, and sink. Set corresponding configuration parameters by referring to :ref:`Table 1 ` based on the actual environment. + + .. note:: + + - If you want to continue using the **properties.propretites** file by modifying it, log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services**. On the page that is displayed, choose **Flume**. On the displayed page, click the **Configuration Tool** tab, click **Import**, import the file, and modify the configuration items related to non-encrypted transmission. + - It is recommended that the numbers of Sources, Channels, and Sinks do not exceed 40 during configuration file import. Otherwise, the response time may be very long. + + .. _mrs_01_1066__table15799134418507: + + .. table:: **Table 1** Parameters to be modified for the Flume role client + + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Example Value | + +=========================+=========================================================================================================================================================================================================================================================================================================================================================================================================================================================+============================================================================================================================================================================================================================================+ + | Name | The value must be unique and cannot be left blank. | test | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.topics | Specifies the subscribed Kafka topic list, in which topics are separated by commas (,). This parameter cannot be left blank. | test1 | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.consumer.group.id | Specifies the data group ID obtained from Kafka. This parameter cannot be left blank. | flume | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | kafka.bootstrap.servers | Specifies the bootstrap IP address and port list of Kafka. The default value is all Kafka lists in a Kafka cluster. If Kafka has been installed in the cluster and its configurations have been synchronized, this parameter can be left blank. | 192.168.101.10:21007 | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | batchSize | Specifies the number of events that Flume sends in a batch (number of data pieces). | 61200 | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | dataDirs | Specifies the directory for storing buffer data. The run directory is used by default. Configuring multiple directories on disks can improve transmission efficiency. Use commas (,) to separate multiple directories. If the directory is inside the cluster, the **/srv/BigData/hadoop/dataX/flume/data** directory can be used. **dataX** ranges from data1 to dataN. If the directory is outside the cluster, it needs to be independently planned. | /srv/BigData/hadoop/data1/flume/data | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | checkpointDir | Specifies the directory for storing the checkpoint information, which is under the run directory by default. If the directory is inside the cluster, the **/srv/BigData/hadoop/dataX/flume/checkpoint** directory can be used. **dataX** ranges from data1 to dataN. If the directory is outside the cluster, it needs to be independently planned. | /srv/BigData/hadoop/data1/flume/checkpoint | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | transactionCapacity | Specifies the transaction size, that is, the number of events in a transaction that can be processed by the current Channel. The size cannot be smaller than the batchSize of Source. Setting the same size as batchSize is recommended. | 61200 | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.path | Specifies the HDFS data write directory. This parameter cannot be left blank. | hdfs://hacluster/flume/test | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.filePrefix | Specifies the file name prefix after data is written to HDFS. | TMP\_ | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.batchSize | Specifies the maximum number of events that can be written to HDFS once. | 61200 | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosPrincipal | Specifies the Kerberos authentication user, which is mandatory in security versions. This configuration is required only in security clusters. | flume_hdfs | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.kerberosKeytab | Specifies the keytab file path for Kerberos authentication, which is mandatory in security versions. This configuration is required only in security clusters. | /opt/test/conf/user.keytab | + | | | | + | | | .. note:: | + | | | | + | | | Obtain the **user.keytab** file from the Kerberos certificate file of the user **flume_hdfs**. In addition, ensure that the user who installs and runs the Flume client has the read and write permissions on the **user.keytab** file. | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hdfs.useLocalTimeStamp | Specifies whether to use the local time. Possible values are **true** and **false**. | true | + +-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + #. Click **Export** to save the **properties.properties** configuration file to the local. + + b. Upload the **properties.properties** file to **flume/conf/** under the installation directory of the Flume client. + + c. To connect the Flume client to the HDFS, you need to add the following configuration: + + #. Download the Kerberos certificate of account **flume_hdfs** and obtain the **krb5.conf** configuration file. Upload the configuration file to the **fusioninsight-flume-1.9.0/conf/** directory on the node where the client is installed. + + #. In **fusioninsight-flume-1.9.0/conf/**, create the **jaas.conf** configuration file. + + **vi jaas.conf** + + .. code-block:: + + KafkaClient { + com.sun.security.auth.module.Krb5LoginModule required + useKeyTab=true + keyTab="/opt/test/conf/user.keytab" + principal="flume_hdfs@" + useTicketCache=false + storeKey=true + debug=true; + }; + + Values of **keyTab** and **principal** vary depending on the actual situation. + + #. Obtain configuration files **core-site.xml** and **hdfs-site.xml** from **/opt/FusionInsight_Cluster\_\ \ \_Flume_ClientConfig/Flume/config** and upload them to **fusioninsight-flume-1.9.0/conf/**. + + d. Run the following command to restart the Flume process: + + **flume-manager.sh restart** + +3. Verify log transmission. + + a. Log in to FusionInsight Manager as a user who has the management permission on HDFS. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Choose **Cluster** > **Services** > **HDFS**. On the page that is displayed, click the **NameNode(**\ *Node name*\ **,Active)** link next to **NameNode WebUI** to go to the HDFS web UI. On the displayed page, choose **Utilities** > **Browse the file system**. + b. Check whether the data is generated in the **/flume/test** directory on the HDFS. diff --git a/doc/component-operation-guide/source/using_flume/overview.rst b/doc/component-operation-guide/source/using_flume/overview.rst new file mode 100644 index 0000000..acfe9f3 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/overview.rst @@ -0,0 +1,117 @@ +:original_name: mrs_01_0391.html + +.. _mrs_01_0391: + +Overview +======== + +Flume is a distributed, reliable, and highly available system for aggregating massive logs, which can efficiently collect, aggregate, and move massive log data from different data sources and store the data in a centralized data storage system. Various data senders can be customized in the system to collect data. Additionally, Flume provides simple data processes capabilities and writes data to data receivers (which is customizable). + +Flume consists of the client and server, both of which are FlumeAgents. The server corresponds to the FlumeServer instance and is directly deployed in a cluster. The client can be deployed inside or outside the cluster. he client-side and service-side FlumeAgents work independently and provide the same functions. + +The client-side FlumeAgent needs to be independently installed. Data can be directly imported to components such as HDFS and Kafka. Additionally, the client-side and service-side FlumeAgents can also work together to provide services. + +Process +------- + +The process for collecting logs using Flume is as follows: + +#. Installing the flume client +#. Configuring the Flume server and client parameters +#. Collecting and querying logs using the Flume client +#. Stopping and uninstalling the Flume client + + +.. figure:: /_static/images/en-us_image_0000001296090416.png + :alt: **Figure 1** Log collection process + + **Figure 1** Log collection process + +Flume Client +------------ + +A Flume client consists of the source, channel, and sink. The source sends the data to the channel, and then the sink transmits the data from the channel to the external device. :ref:`Table 1 ` describes Flume modules. + +.. _mrs_01_0391__t3f29550548a749a4831f4ddfc95df002: + +.. table:: **Table 1** Module description + + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Name | Description | + +===================================+=============================================================================================================================================================================+ + | Source | A source receives or generates data and sends the data to one or multiple channels. The source can work in either data-driven or polling mode. | + | | | + | | Typical sources include: | + | | | + | | - Sources that are integrated with the system and receives data, such as Syslog and Netcat | + | | - Sources that automatically generate event data, such as Exec and SEQ | + | | - IPC sources that are used for communication between agents, such as Avro | + | | | + | | A Source must associate with at least one channel. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Channel | A channel is used to buffer data between a source and a sink. After the sink transmits the data to the next channel or the destination, the cache is deleted automatically. | + | | | + | | The persistency of the channels varies with the channel types: | + | | | + | | - Memory channel: non-persistency | + | | - File channel: persistency implemented based on write-ahead logging (WAL) | + | | - JDBC channel: persistency implemented based on the embedded database | + | | | + | | Channels support the transaction feature to ensure simple sequential operations. A channel can work with sources and sinks of any quantity. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Sink | Sink is responsible for sending data to the next hop or final destination and removing the data from the channel after successfully sending the data. | + | | | + | | Typical sinks include: | + | | | + | | - Sinks that send storage data to the final destination, such as HDFS and Kafka | + | | - Sinks that are consumed automatically, such as Null Sink | + | | - IPC sinks that are used for communication between agents, such as Avro | + | | | + | | A sink must associate with at least one channel. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +A Flume client can have multiple sources, channels, and sinks. A source can send data to multiple channels, and then multiple sinks send the data out of the client. + +Multiple Flume clients can be cascaded. That is, a sink can send data to the source of another client. + +Supplementary Information +------------------------- + +#. Flume provides the following reliability measures: + + - The transaction mechanism is implemented between sources and channels, and between channels and sinks. + + - The sink processor supports the failover and load balancing (load_balance) mechanisms. + + The following is an example of the load balancing (load_balance) configuration: + + .. code-block:: + + server.sinkgroups=g1 + server.sinkgroups.g1.sinks=k1 k2 + server.sinkgroups.g1.processor.type=load_balance + server.sinkgroups.g1.processor.backoff=true + server.sinkgroups.g1.processor.selector=random + +#. The following are precautions for the aggregation and cascading of multiple Flume clients: + + - Avro or Thrift protocol can be used for cascading. + - When the aggregation end contains multiple nodes, evenly distribute the clients to these nodes. Do not connect all the clients to a single node. + +#. The Flume client can contain multiple independent data flows. That is, multiple sources, channels, and sinks can be configured in the **properties.properties** configuration file. These components can be linked to form multiple flows. + + For example, to configure two data flows in a configuration, run the following commands: + + .. code-block:: + + server.sources = source1 source2 + server.sinks = sink1 sink2 + server.channels = channel1 channel2 + + #dataflow1 + server.sources.source1.channels = channel1 + server.sinks.sink1.channel = channel1 + + #dataflow2 + server.sources.source2.channels = channel2 + server.sinks.sink2.channel = channel2 diff --git a/doc/component-operation-guide/source/using_flume/secondary_development_guide_for_flume_third-party_plug-ins.rst b/doc/component-operation-guide/source/using_flume/secondary_development_guide_for_flume_third-party_plug-ins.rst new file mode 100644 index 0000000..d6ab21f --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/secondary_development_guide_for_flume_third-party_plug-ins.rst @@ -0,0 +1,52 @@ +:original_name: mrs_01_1083.html + +.. _mrs_01_1083: + +Secondary Development Guide for Flume Third-Party Plug-ins +========================================================== + +Scenario +-------- + +This section describes how to perform secondary development for third-party plug-ins. + +This section applies to MRS 3.\ *x* or later. + +Prerequisites +------------- + +- You have obtained the third-party JAR package. + +- You have installed Flume server or client. + +Procedure +--------- + +#. Compress the self-developed code into a JAR package. + +#. Create a directory for the plug-in. + + a. Access the **$FLUME_HOME/plugins.d** path and run the following command to create a directory: + + **mkdir thirdPlugin** + + **cd thirdPlugin** + + **mkdir lib libext** **native** + + The command output is displayed as follows: + + |image1| + + b. Place the third-party JAR package in the **$FLUME_HOME/plugins.d/thirdPlugin/lib** directory. If the JAR package depends on other JAR packages, place the depended JAR packages to the **$FLUME_HOME/ plugins.d/ thirdPlugin/libext** directory, and place the local library files in **$FLUME_HOME/ plugins.d/ thirdPlugin/native**. + +#. Configure the **properties.properties** file in **$FLUME_HOME/conf/**. + + For details about how to set parameters in the **properties.properties** file, see the parameter list in the **properties.properties** file in the corresponding typical scenario :ref:`Non-Encrypted Transmission ` and :ref:`Encrypted Transmission `. + + .. note:: + + - **$FLUME_HOME** indicates the Flume installation path. Set this parameter based on the site requirements (server or client) when configuring third-party plug-ins. + - **thirdPlugin** is the name of the third-party plugin. + +.. |image1| image:: /_static/images/en-us_image_0000001388527202.png diff --git a/doc/component-operation-guide/source/using_flume/stopping_or_uninstalling_the_flume_client.rst b/doc/component-operation-guide/source/using_flume/stopping_or_uninstalling_the_flume_client.rst new file mode 100644 index 0000000..7029045 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/stopping_or_uninstalling_the_flume_client.rst @@ -0,0 +1,46 @@ +:original_name: mrs_01_0394.html + +.. _mrs_01_0394: + +Stopping or Uninstalling the Flume Client +========================================= + +Scenario +-------- + +You can stop and start the Flume client or uninstall the Flume client when the Flume data ingestion channel is not required. + +Procedure +--------- + +- Stop the Flume client of the Flume role. + + Assume that the Flume client installation path is **/opt/FlumeClient**. Run the following command to stop the Flume client: + + **cd /opt/FlumeClient/fusioninsight-flume-**\ *Flume component version number*\ **/bin** + + **./flume-manage.sh stop** + + If the following information is displayed after the command execution, the Flume client is successfully stopped. + + .. code-block:: + + Stop Flume PID=120689 successful.. + + .. note:: + + The Flume client will be automatically restarted after being stopped. If you do not need automatic restart, run the following command: + + **./flume-manage.sh stop force** + + If you want to restart the Flume client, run the following command: + + **./flume-manage.sh start force** + +- Uninstall the Flume client of the Flume role. + + Assume that the Flume client installation path is **/opt/FlumeClient**. Run the following command to uninstall the Flume client: + + **cd /opt/FlumeClient/fusioninsight-flume-**\ *Flume component version number*\ **/inst** + + **./uninstall.sh** diff --git a/doc/component-operation-guide/source/using_flume/using_environment_variables_in_the_properties.properties_file.rst b/doc/component-operation-guide/source/using_flume/using_environment_variables_in_the_properties.properties_file.rst new file mode 100644 index 0000000..5d26cbb --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/using_environment_variables_in_the_properties.properties_file.rst @@ -0,0 +1,71 @@ +:original_name: mrs_01_1058.html + +.. _mrs_01_1058: + +Using Environment Variables in the **properties.properties** File +================================================================= + +Scenario +-------- + +This section describes how to use environment variables in the **properties.properties** configuration file. + +This section applies to MRS 3.\ *x* or later clusters. + +Prerequisites +------------- + +The Flume service is running properly and the Flume client has been installed. + +Procedure +--------- + +#. Log in to the node where the Flume client is installed as user **root**. + +#. Switch to the following directory: + + **cd** *Flume client installation directory*/**fusioninsight-flume**\ ``-``\ *Flume component version*/**conf** + +#. Add environment variables to the **flume-env.sh** file in the directory. + + - Format: + + .. code-block:: + + export Variable name=Variable value + + - Example: + + .. code-block:: + + JAVA_OPTS="-Xms2G -Xmx4G -XX:CMSFullGCsBeforeCompaction=1 -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -DpropertiesImplementation=org.apache.flume.node.EnvVarResolverProperties" + export TAILDIR_PATH=/tmp/flumetest/201907/20190703/1/.*log.* + +#. Restart the Flume instance process. + + a. Log in to FusionInsight Manager. + b. Choose **Cluster** > **Services** > **Flume**. On the page that is displayed, click the **Instance** tab, select all Flume instances, and choose **More** > **Restart Instance**. In the displayed **Verify Identity** dialog box, enter the password, and click **OK**. + + .. important:: + + Do not restart the Flume service on FusionInsight Manager after **flume-env.sh** takes effect on the server. Otherwise, the user-defined environment variables will lost. You only need to restart the corresponding instances on FusionInsight Manager. + +#. .. _mrs_01_1058__li17459142018584: + + In the *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf/properties.properties** configuration file, reference variables in the **${**\ *Variable name*\ **}** format. The following is an example: + + .. code-block:: + + client.sources.s1.type = TAILDIR + client.sources.s1.filegroups = f1 + client.sources.s1.filegroups.f1 = ${TAILDIR_PATH} + client.sources.s1.positionFile = /tmp/flumetest/201907/20190703/1/taildir_position.json + client.sources.s1.channels = c1 + + .. important:: + + - Ensure that **flume-env.sh** takes effect before you go to :ref:`5 ` to configure the **properties.properties** file. + - If you configure file on the local host, upload the file on FusionInsight Manager by performing the following steps. The user-defined environment variables may be lost if the operations are not performed in the correct sequence. + + a. Log in to FusionInsight Manager. + b. Choose **Cluster** > **Services** > **Flume**. On the page that is displayed, click the **Configurations** tab, select the Flume instance, and click **Upload File** next to **flume.config.file** to upload the **properties.properties** file. diff --git a/doc/component-operation-guide/source/using_flume/using_flume_from_scratch.rst b/doc/component-operation-guide/source/using_flume/using_flume_from_scratch.rst new file mode 100644 index 0000000..c9b55c3 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/using_flume_from_scratch.rst @@ -0,0 +1,310 @@ +:original_name: mrs_01_0397.html + +.. _mrs_01_0397: + +Using Flume from Scratch +======================== + +Scenario +-------- + +You can use Flume to import collected log information to Kafka. + +Prerequisites +------------- + +- A streaming cluster that contains components such as Flume and Kafka and has Kerberos authentication enabled has been created. +- The streaming cluster can properly communicate with the node where logs are generated. + +Using the Flume Client (Versions Earlier Than MRS 3.x) +------------------------------------------------------ + +.. note:: + + You do not need to perform :ref:`2 ` to :ref:`6 ` for a normal cluster. + +#. Install the Flume client. + + Install the Flume client in a directory, for example, **/opt/Flumeclient**, on the node where logs are generated by referring to :ref:`Installing the Flume Client on Clusters of Versions Earlier Than MRS 3.x `. The Flume client installation directories in the following steps are only examples. Change them to the actual installation directories. + +#. .. _mrs_01_0397__l78730912572649fd8edfda3920dc20cf: + + Copy the configuration file of the authentication server from the Master1 node to the *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf** directory on the node where the Flume client is installed. + + For versions earlier than MRS 1.9.2, **${BIGDATA_HOME}/FusionInsight/etc/1\_**\ *X*\ **\_KerberosClient/kdc.conf** is used as the full file path. + + For versions earlier than MRS 3.\ *x*, **${BIGDATA_HOME}/MRS_Current/1\_**\ *X*\ **\_KerberosClient/etc/kdc.conf** is used as the full file path. + + In the preceding paths, **X** indicates a random number. Change it based on the site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. Check the service IP address of any node where the Flume role is deployed. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager. Choose **Cluster** > **Services** > **Flume** > **Instance**. Query **Service IP Address** of any node on which the Flume role is deployed. + - For MRS 1.9.2 to versions earlier than 3.x, click the cluster name on the MRS console and choose *Name of the desired cluster* > **Components** > **Flume** > **Instances** to view **Business IP Address** of any node where the Flume role is deployed. + +#. .. _mrs_01_0397__l762ab29694a642ac8ae1a0609cb97c9b: + + Copy the user authentication file from this node to the *Flume client installation directory*\ **/fusioninsight-flume-Flume component version number/conf** directory on the Flume client node. + + For versions earlier than MRS 1.9.2, **${BIGDATA_HOME}/FusionInsight/FusionInsight-Flume-**\ *Flume component version number*\ **/flume/conf/flume.keytab** is used as the full file path. + + For versions earlier than 3.\ *x*, **${BIGDATA_HOME}/MRS\_**\ *XXX*\ **/install/FusionInsight-Flume-**\ *Flume component version number*\ **/flume/conf/flume.keytab** is used as the full file path. + + In the preceding paths, **XXX** indicates the product version number. Change it based on the site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. Copy the **jaas.conf** file from this node to the **conf** directory on the Flume client node. + + For versions earlier than MRS 1.9.2, **${BIGDATA_HOME}/FusionInsight/etc/1\_**\ *X*\ **\_Flume/jaas.conf** is used as the full file path. + + For versions earlier than MRS 3.\ *x*, **${BIGDATA_HOME}/MRS_Current/1\_**\ *X*\ **\_Flume/etc/jaas.conf** is used as the full file path. + + In the preceding path, **X** indicates a random number. Change it based on the site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. .. _mrs_01_0397__lfde322e0f3de4ccb88b4e195e65f9993: + + Log in to the Flume client node and go to the client installation directory. Run the following command to modify the file: + + **vi conf/jaas.conf** + + Change the full path of the user authentication file defined by **keyTab** to the **Flume client installation directory/fusioninsight-flume-*Flume component version number*/conf** saved in :ref:`4 `, and save the modification and exit. + +#. Run the following command to modify the **flume-env.sh** configuration file of the Flume client: + + **vi** *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf/flume-env.sh** + + Add the following information after **-XX:+UseCMSCompactAtFullCollection**: + + .. code-block:: + + -Djava.security.krb5.conf=Flume client installation directory/fusioninsight-flume-1.9.0/conf/kdc.conf -Djava.security.auth.login.config=Flume client installation directory/fusioninsight-flume-1.9.0/conf/jaas.conf -Dzookeeper.request.timeout=120000 + + Example: **"-XX:+UseCMSCompactAtFullCollection -Djava.security.krb5.conf=/opt/FlumeClient**/**fusioninsight-flume-**\ *Flume component version number*\ **/conf/kdc.conf -Djava.security.auth.login.config=/opt/FlumeClient**/**fusioninsight-flume-**\ *Flume component version number*\ **/conf/jaas.conf -Dzookeeper.request.timeout=120000"** + + Change *Flume client installation directory* to the actual installation directory. Then save and exit. + +#. Run the following command to restart the Flume client: + + **cd** *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/bin** + + **./flume-manage.sh restart** + + Example: + + **cd /opt/FlumeClient/fusioninsight-flume-**\ *Flume component version number*\ **/bin** + + **./flume-manage.sh restart** + +#. Run the following command to configure and save jobs in the Flume client configuration file **properties.properties** based on service requirements. + + **vi** *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf/properties.properties** + + The following uses SpoolDir Source+File Channel+Kafka Sink as an example: + + .. code-block:: + + ######################################################################################### + client.sources = static_log_source + client.channels = static_log_channel + client.sinks = kafka_sink + ######################################################################################### + #LOG_TO_HDFS_ONLINE_1 + + client.sources.static_log_source.type = spooldir + client.sources.static_log_source.spoolDir = Monitoring directory + client.sources.static_log_source.fileSuffix = .COMPLETED + client.sources.static_log_source.ignorePattern = ^$ + client.sources.static_log_source.trackerDir = Metadata storage path during transmission + client.sources.static_log_source.maxBlobLength = 16384 + client.sources.static_log_source.batchSize = 51200 + client.sources.static_log_source.inputCharset = UTF-8 + client.sources.static_log_source.deserializer = LINE + client.sources.static_log_source.selector.type = replicating + client.sources.static_log_source.fileHeaderKey = file + client.sources.static_log_source.fileHeader = false + client.sources.static_log_source.basenameHeader = true + client.sources.static_log_source.basenameHeaderKey = basename + client.sources.static_log_source.deletePolicy = never + + client.channels.static_log_channel.type = file + client.channels.static_log_channel.dataDirs = Data cache path. Multiple paths, separated by commas (,), can be configured to improve performance. + client.channels.static_log_channel.checkpointDir = Checkpoint storage path + client.channels.static_log_channel.maxFileSize = 2146435071 + client.channels.static_log_channel.capacity = 1000000 + client.channels.static_log_channel.transactionCapacity = 612000 + client.channels.static_log_channel.minimumRequiredSpace = 524288000 + + client.sinks.kafka_sink.type = org.apache.flume.sink.kafka.KafkaSink + client.sinks.kafka_sink.kafka.topic = Topic to which data is written, for example, flume_test + client.sinks.kafka_sink.kafka.bootstrap.servers = XXX.XXX.XXX.XXX:Kafka port number,XXX.XXX.XXX.XXX:Kafka port number,XXX.XXX.XXX.XXX:Kafka port number + client.sinks.kafka_sink.flumeBatchSize = 1000 + client.sinks.kafka_sink.kafka.producer.type = sync + client.sinks.kafka_sink.kafka.security.protocol = SASL_PLAINTEXT + client.sinks.kafka_sink.kafka.kerberos.domain.name = Kafka domain name. This parameter is mandatory for a security cluster, for example, hadoop.xxx.com. + client.sinks.kafka_sink.requiredAcks = 0 + + client.sources.static_log_source.channels = static_log_channel + client.sinks.kafka_sink.channel = static_log_channel + + .. note:: + + - **client.sinks.kafka_sink.kafka.topic**: Topic to which data is written. If the topic does not exist in Kafka, it is automatically created by default. + + - **client.sinks.kafka_sink.kafka.bootstrap.servers**: List of Kafka Brokers, which are separated by commas (,). By default, the port is **21007** for a security cluster and **9092** for a normal cluster. + + - **client.sinks.kafka_sink.kafka.security.protocol**: The value is **SASL_PLAINTEXT** for a security cluster and **PLAINTEXT** for a normal cluster. + + - **client.sinks.kafka_sink.kafka.kerberos.domain.name**: + + You do not need to set this parameter for a normal cluster. For a security cluster, the value of this parameter is the value of **kerberos.domain.name** in the Kafka cluster. + + For versions earlier than MRS 1.9.2, obtain the value by checking **${BIGDATA_HOME}/FusionInsight/etc/1\_**\ *X*\ **\_Broker/server.properties** on the node where the broker instance resides. + + Obtain the value for versions earlier than MRS 3.\ *x* by checking **${BIGDATA_HOME}/MRS_Current/1\_**\ *X*\ **\_Broker/etc/server.properties** on the node where the broker instance resides. + + In the preceding paths, **X** indicates a random number. Change it based on site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. After the parameters are set and saved, the Flume client automatically loads the content configured in **properties.properties**. When new log files are generated by spoolDir, the files are sent to Kafka producers and can be consumed by Kafka consumers. + +Using the Flume Client (MRS 3.x or Later) +----------------------------------------- + +.. note:: + + You do not need to perform :ref:`2 ` to :ref:`6 ` for a normal cluster. + +#. Install the Flume client. + + Install the Flume client in a directory, for example, **/opt/Flumeclient**, on the node where logs are generated by referring to :ref:`Installing the Flume Client on MRS 3.x or Later Clusters `. The Flume client installation directories in the following steps are only examples. Change them to the actual installation directories. + +#. .. _mrs_01_0397__li81278495417: + + Copy the configuration file of the authentication server from the Master1 node to the *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf** directory on the node where the Flume client is installed. + + The full file path is **${BIGDATA_HOME}/FusionInsight\_**\ **BASE\_**\ *XXX*\ **/1\_**\ *X*\ **\_KerberosClient/etc/kdc.conf**. In the preceding path, **XXX** indicates the product version number. **X** indicates a random number. Replace them based on site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. Check the service IP address of any node where the Flume role is deployed. + + Log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Choose **Cluster > Services > Flume > Instance**. Check the service IP address of any node where the Flume role is deployed. + +#. .. _mrs_01_0397__li4130849748: + + Copy the user authentication file from this node to the *Flume client installation directory*\ **/fusioninsight-flume-Flume component version number/conf** directory on the Flume client node. + + The full file path is **${BIGDATA_HOME}/FusionInsight_Porter\_**\ *XXX*\ **/install/FusionInsight-Flume-**\ *Flume component version number*\ **/flume/conf/flume.keytab**. + + In the preceding paths, **XXX** indicates the product version number. Change it based on the site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. Copy the **jaas.conf** file from this node to the **conf** directory on the Flume client node. + + The full file path is **${BIGDATA_HOME}/FusionInsight_Current/1\_**\ *X*\ **\_Flume/etc/jaas.conf**. + + In the preceding path, **X** indicates a random number. Change it based on the site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. .. _mrs_01_0397__li31329494415: + + Log in to the Flume client node and go to the client installation directory. Run the following command to modify the file: + + **vi conf/jaas.conf** + + Change the full path of the user authentication file defined by **keyTab** to the **Flume client installation directory/fusioninsight-flume-*Flume component version number*/conf** saved in :ref:`4 `, and save the modification and exit. + +#. Run the following command to modify the **flume-env.sh** configuration file of the Flume client: + + **vi** *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf/flume-env.sh** + + Add the following information after **-XX:+UseCMSCompactAtFullCollection**: + + .. code-block:: + + -Djava.security.krb5.conf=Flume client installation directory/fusioninsight-flume-1.9.0/conf/kdc.conf -Djava.security.auth.login.config=Flume client installation directory/fusioninsight-flume-1.9.0/conf/jaas.conf -Dzookeeper.request.timeout=120000 + + Example: **"-XX:+UseCMSCompactAtFullCollection -Djava.security.krb5.conf=/opt/FlumeClient**/**fusioninsight-flume-**\ *Flume component version number*\ **/conf/kdc.conf -Djava.security.auth.login.config=/opt/FlumeClient**/**fusioninsight-flume-**\ *Flume component version number*\ **/conf/jaas.conf -Dzookeeper.request.timeout=120000"** + + Change *Flume client installation directory* to the actual installation directory. Then save and exit. + +#. Run the following command to restart the Flume client: + + **cd** *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/bin** + + **./flume-manage.sh restart** + + Example: + + **cd /opt/FlumeClient/fusioninsight-flume-**\ *Flume component version number*\ **/bin** + + **./flume-manage.sh restart** + +#. Configure jobs based on actual service scenarios. + + - Some parameters, for MRS 3.\ *x* or later, can be configured on Manager. + + - Set the parameters in the **properties.properties** file. The following uses SpoolDir Source+File Channel+Kafka Sink as an example. + + Run the following command on the node where the Flume client is installed. Configure and save jobs in the Flume client configuration file **properties.properties** based on actual service requirements. + + **vi** *Flume client installation directory*\ **/fusioninsight-flume-**\ *Flume component version number*\ **/conf/properties.properties** + + .. code-block:: + + ######################################################################################### + client.sources = static_log_source + client.channels = static_log_channel + client.sinks = kafka_sink + ######################################################################################### + #LOG_TO_HDFS_ONLINE_1 + + client.sources.static_log_source.type = spooldir + client.sources.static_log_source.spoolDir = Monitoring directory + client.sources.static_log_source.fileSuffix = .COMPLETED + client.sources.static_log_source.ignorePattern = ^$ + client.sources.static_log_source.trackerDir = Metadata storage path during transmission + client.sources.static_log_source.maxBlobLength = 16384 + client.sources.static_log_source.batchSize = 51200 + client.sources.static_log_source.inputCharset = UTF-8 + client.sources.static_log_source.deserializer = LINE + client.sources.static_log_source.selector.type = replicating + client.sources.static_log_source.fileHeaderKey = file + client.sources.static_log_source.fileHeader = false + client.sources.static_log_source.basenameHeader = true + client.sources.static_log_source.basenameHeaderKey = basename + client.sources.static_log_source.deletePolicy = never + + client.channels.static_log_channel.type = file + client.channels.static_log_channel.dataDirs = Data cache path. Multiple paths, separated by commas (,), can be configured to improve performance. + client.channels.static_log_channel.checkpointDir = Checkpoint storage path + client.channels.static_log_channel.maxFileSize = 2146435071 + client.channels.static_log_channel.capacity = 1000000 + client.channels.static_log_channel.transactionCapacity = 612000 + client.channels.static_log_channel.minimumRequiredSpace = 524288000 + + client.sinks.kafka_sink.type = org.apache.flume.sink.kafka.KafkaSink + client.sinks.kafka_sink.kafka.topic = Topic to which data is written, for example, flume_test + client.sinks.kafka_sink.kafka.bootstrap.servers = XXX.XXX.XXX.XXX:Kafka port number,XXX.XXX.XXX.XXX:Kafka port number,XXX.XXX.XXX.XXX:Kafka port number + client.sinks.kafka_sink.flumeBatchSize = 1000 + client.sinks.kafka_sink.kafka.producer.type = sync + client.sinks.kafka_sink.kafka.security.protocol = SASL_PLAINTEXT + client.sinks.kafka_sink.kafka.kerberos.domain.name = Kafka domain name. This parameter is mandatory for a security cluster, for example, hadoop.xxx.com. + client.sinks.kafka_sink.requiredAcks = 0 + + client.sources.static_log_source.channels = static_log_channel + client.sinks.kafka_sink.channel = static_log_channel + + .. note:: + + - **client.sinks.kafka_sink.kafka.topic**: Topic to which data is written. If the topic does not exist in Kafka, it is automatically created by default. + + - **client.sinks.kafka_sink.kafka.bootstrap.servers**: List of Kafka Brokers, which are separated by commas (,). By default, the port is **21007** for a security cluster and **9092** for a normal cluster. + + - **client.sinks.kafka_sink.kafka.security.protocol**: The value is **SASL_PLAINTEXT** for a security cluster and **PLAINTEXT** for a normal cluster. + + - **client.sinks.kafka_sink.kafka.kerberos.domain.name**: + + You do not need to set this parameter for a normal cluster. For a security cluster, the value of this parameter is the value of **kerberos.domain.name** in the Kafka cluster. + + For versions earlier than MRS 1.9.2, obtain the value by checking **${BIGDATA_HOME}/FusionInsight/etc/1\_**\ *X*\ **\_Broker/server.properties** on the node where the broker instance resides. + + Obtain the value for versions earlier than MRS 3.\ *x* by checking **${BIGDATA_HOME}/MRS_Current/1\_**\ *X*\ **\_Broker/etc/server.properties** on the node where the broker instance resides. + + In the preceding paths, **X** indicates a random number. Change it based on site requirements. The file must be saved by the user who installs the Flume client, for example, user **root**. + +#. After the parameters are set and saved, the Flume client automatically loads the content configured in **properties.properties**. When new log files are generated by spoolDir, the files are sent to Kafka producers and can be consumed by Kafka consumers. diff --git a/doc/component-operation-guide/source/using_flume/using_the_encryption_tool_of_the_flume_client.rst b/doc/component-operation-guide/source/using_flume/using_the_encryption_tool_of_the_flume_client.rst new file mode 100644 index 0000000..ee010e8 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/using_the_encryption_tool_of_the_flume_client.rst @@ -0,0 +1,43 @@ +:original_name: mrs_01_0395.html + +.. _mrs_01_0395: + +Using the Encryption Tool of the Flume Client +============================================= + +Scenario +-------- + +You can use the encryption tool provided by the Flume client to encrypt some parameter values in the configuration file. + +Prerequisites +------------- + +The Flume client has been installed. + +Procedure +--------- + +#. Log in to the Flume client node and go to the client installation directory, for example, **/opt/FlumeClient**. + +#. Run the following command to switch the directory: + + **cd fusioninsight-flume-**\ *Flume component version number*\ **/bin** + +#. Run the following command to encrypt information: + + **./genPwFile.sh** + + Input the information that you want to encrypt twice. + +#. Run the following command to query the encrypted information: + + **cat password.property** + + .. note:: + + If the encryption parameter is used for the Flume server, you need to perform encryption on the corresponding Flume server node. You need to run the encryption script as user **omm** for encryption. + + - For versions earlier than MRS 1.9.2, the encryption path is **${BIGDATA_HOME}/FusionInsight/FusionInsight-Flume-*Flume component version number*/flume/bin/genPwFile.sh**. + - For versions earlier than MRS 3.x, the encryption path is **/opt/Bigdata/MRS_XXX/install/FusionInsight-Flume-*Flume component version number*/flume/bin/genPwFile.sh**. + - For MRS 3.\ *x* or later, the encryption path is **/opt/Bigdata/FusionInsight_Porter_XXX/install/FusionInsight-Flume-*Flume component version number*/flume/bin/genPwFile.sh**. *XXX* indicates the product version number. diff --git a/doc/component-operation-guide/source/using_flume/viewing_flume_client_logs.rst b/doc/component-operation-guide/source/using_flume/viewing_flume_client_logs.rst new file mode 100644 index 0000000..21b9e58 --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/viewing_flume_client_logs.rst @@ -0,0 +1,55 @@ +:original_name: mrs_01_0393.html + +.. _mrs_01_0393: + +Viewing Flume Client Logs +========================= + +Scenario +-------- + +You can view logs to locate faults. + +Prerequisites +------------- + +The Flume client has been installed. + +Procedure +--------- + +#. Go to the Flume client log directory (**/var/log/Bigdata** by default). + +#. Run the following command to view the log file: + + **ls -lR flume-client-\*** + + A log file is shown as follows: + + .. code-block:: + + flume-client-1/flume: + total 7672 + -rw-------. 1 root root 0 Sep 8 19:43 Flume-audit.log + -rw-------. 1 root root 1562037 Sep 11 06:05 FlumeClient.2017-09-11_04-05-09.[1].log.zip + -rw-------. 1 root root 6127274 Sep 11 14:47 FlumeClient.log + -rw-------. 1 root root 2935 Sep 8 22:20 flume-root-20170908202009-pid72456-gc.log.0.current + -rw-------. 1 root root 2935 Sep 8 22:27 flume-root-20170908202634-pid78789-gc.log.0.current + -rw-------. 1 root root 4382 Sep 8 22:47 flume-root-20170908203137-pid84925-gc.log.0.current + -rw-------. 1 root root 4390 Sep 8 23:46 flume-root-20170908204918-pid103920-gc.log.0.current + -rw-------. 1 root root 3196 Sep 9 10:12 flume-root-20170908215351-pid44372-gc.log.0.current + -rw-------. 1 root root 2935 Sep 9 10:13 flume-root-20170909101233-pid55119-gc.log.0.current + -rw-------. 1 root root 6441 Sep 9 11:10 flume-root-20170909101631-pid59301-gc.log.0.current + -rw-------. 1 root root 0 Sep 9 11:10 flume-root-20170909111009-pid119477-gc.log.0.current + -rw-------. 1 root root 92896 Sep 11 13:24 flume-root-20170909111126-pid120689-gc.log.0.current + -rw-------. 1 root root 5588 Sep 11 14:46 flume-root-20170911132445-pid42259-gc.log.0.current + -rw-------. 1 root root 2576 Sep 11 13:24 prestartDetail.log + -rw-------. 1 root root 3303 Sep 11 13:24 startDetail.log + -rw-------. 1 root root 1253 Sep 11 13:24 stopDetail.log + + flume-client-1/monitor: + total 8 + -rw-------. 1 root root 141 Sep 8 19:43 flumeMonitorChecker.log + -rw-------. 1 root root 2946 Sep 11 13:24 flumeMonitor.log + + In the log file, **FlumeClient.log** is the run log of the Flume client. diff --git a/doc/component-operation-guide/source/using_flume/viewing_flume_client_monitoring_information.rst b/doc/component-operation-guide/source/using_flume/viewing_flume_client_monitoring_information.rst new file mode 100644 index 0000000..932be1a --- /dev/null +++ b/doc/component-operation-guide/source/using_flume/viewing_flume_client_monitoring_information.rst @@ -0,0 +1,21 @@ +:original_name: mrs_01_1596.html + +.. _mrs_01_1596: + +Viewing Flume Client Monitoring Information +=========================================== + +Scenario +-------- + +The Flume client outside the FusionInsight cluster is a part of the end-to-end data collection. Both the Flume client outside the cluster and the Flume server in the cluster need to be monitored. Users can use FusionInsight Manager to monitor the Flume client and view the monitoring indicators of the Source, Sink, and Channel of the client as well as the client process status. + +This section applies to MRS 3.\ *x* or later clusters. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Flume** > **Flume Management** to view the current Flume client list and process status. +#. Click the **Instance ID**, and view client monitoring metrics in the **Current** area. +#. Click **History**. The page for querying historical monitoring data is displayed. Select a time range and click **View** to view the monitoring data within the time range. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/how_do_i_deal_with_the_restrictions_of_the_phoenix_bulkload_tool.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/how_do_i_deal_with_the_restrictions_of_the_phoenix_bulkload_tool.rst new file mode 100644 index 0000000..26b20f4 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/how_do_i_deal_with_the_restrictions_of_the_phoenix_bulkload_tool.rst @@ -0,0 +1,72 @@ +:original_name: mrs_01_2211.html + +.. _mrs_01_2211: + +How Do I Deal with the Restrictions of the Phoenix BulkLoad Tool? +================================================================= + +Question +-------- + +When the indexed field data is updated, if a batch of data exists in the user table, the BulkLoad tool cannot update the global and partial mutable indexes. + +Answer +------ + +**Problem Analysis** + +#. Create a table. + + .. code-block:: + + CREATE TABLE TEST_TABLE( + DATE varchar not null, + NUM integer not null, + SEQ_NUM integer not null, + ACCOUNT1 varchar not null, + ACCOUNTDES varchar, + FLAG varchar, + SALL double, + CONSTRAINT PK PRIMARY KEY (DATE,NUM,SEQ_NUM,ACCOUNT1) + ); + +#. Create a global index. + + **CREATE INDEX TEST_TABLE_INDEX ON TEST_TABLE(ACCOUNT1,DATE,NUM,ACCOUNTDES,SEQ_NUM)**; + +#. Insert data. + + **UPSERT INTO TEST_TABLE (DATE,NUM,SEQ_NUM,ACCOUNT1,ACCOUNTDES,FLAG,SALL) values ('20201001',30201001,13,'367392332','sffa1','','');** + +#. Execute the BulkLoad task to update data. + + **hbase org.apache.phoenix.mapreduce.CsvBulkLoadTool -t TEST_TABLE -i /tmp/test.csv**, where the content of **test.csv** is as follows: + + ======== ======== == ========= ======= ======= == + 20201001 30201001 13 367392332 sffa888 1231243 23 + ======== ======== == ========= ======= ======= == + +#. Symptom: The existing index data cannot be directly updated. As a result, two pieces of index data exist. + + .. code-block:: + + +------------+-----------+-----------+---------------+----------------+ + | :ACCOUNT1 | :DATE | :NUM | 0:ACCOUNTDES | :SEQ_NUM | + +------------+-----------+-----------+---------------+----------------+ + | 367392332 | 20201001 | 30201001 | sffa1 | 13 | + | 367392332 | 20201001 | 30201001 | sffa888 | 13 | + +------------+-----------+-----------+---------------+----------------+ + +**Solution** + +#. Delete the old index table. + + **DROP INDEX TEST_TABLE_INDEX ON TEST_TABLE;** + +#. Create an index table in asynchronous mode. + + **CREATE INDEX TEST_TABLE_INDEX ON TEST_TABLE(ACCOUNT1,DATE,NUM,ACCOUNTDES,SEQ_NUM) ASYNC;** + +#. Recreate a index. + + **hbase org.apache.phoenix.mapreduce.index.IndexTool --data-table TEST_TABLE --index-table TEST_TABLE_INDEX --output-path /user/test_table** diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/how_do_i_delete_residual_table_names_in_the__hbase_table-lock_directory_of_zookeeper.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/how_do_i_delete_residual_table_names_in_the__hbase_table-lock_directory_of_zookeeper.rst new file mode 100644 index 0000000..eaf4801 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/how_do_i_delete_residual_table_names_in_the__hbase_table-lock_directory_of_zookeeper.rst @@ -0,0 +1,23 @@ +:original_name: mrs_01_1652.html + +.. _mrs_01_1652: + +How Do I Delete Residual Table Names in the /hbase/table-lock Directory of ZooKeeper? +===================================================================================== + +Question +-------- + +In security mode, names of tables that failed to be created are unnecessarily retained in the table-lock node (default directory is /hbase/table-lock) of ZooKeeper. How do I delete these residual table names? + +Answer +------ + +Perform the following steps: + +#. On the client, run the kinit command as the hbase user to obtain a security certificate. +#. Run the **hbase zkcli** command to launch the ZooKeeper Command Line Interface (zkCLI). +#. Run the **ls /hbase/table** command on the zkCLI to check whether the table name of the table that fails to be created exists. + + - If the table name exists, no further operation is required. + - If the table name does not exist, run **ls /hbase/table-lock** to check whether the table name of the table fail to be created exist. If the table name exists, run the **delete /hbase/table-lock/** command to delete the table name. In the **delete /hbase/table-lock/
** command, **
** indicates the residual table name. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/how_do_i_fix_region_overlapping.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/how_do_i_fix_region_overlapping.rst new file mode 100644 index 0000000..a383cac --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/how_do_i_fix_region_overlapping.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_1660.html + +.. _mrs_01_1660: + +How Do I Fix Region Overlapping? +================================ + +Question +-------- + +When the HBaseFsck tool is used to check the region status in MRS 3.x and later versions, if the log contains **ERROR: (regions region1 and region2) There is an overlap in the region chain** or **ERROR: (region region1) Multiple regions have the same startkey: xxx**, overlapping exists in some regions. How do I solve this problem? + +Answer +------ + +To rectify the fault, perform the following steps: + +#. .. _mrs_01_1660__l57959cf11dc74b388d62a55b172f9fa6: + + Run the **hbase hbck -repair** *tableName* command to restore the table that contains overlapping. + +#. Run the **hbase hbck** *tableName* command to check whether overlapping exists in the restored table. + + - If overlapping does not exist, go to :ref:`3 `. + - If overlapping exists, go to :ref:`1 `. + +#. .. _mrs_01_1660__lc78ee31171b54bc988743bab2a08bbc9: + + Log in to FusionInsight Manager and choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **More** > **Perform HMaster Switchover** to complete the HMaster active/standby switchover. + +#. Run the **hbase hbck** *tableName* command to check whether overlapping exists in the restored table. + + - If overlapping does not exist, no further action is required. + - If overlapping still exists, start from :ref:`1 ` to perform the recovery again. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/how_do_i_restore_a_region_in_the_rit_state_for_a_long_time.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/how_do_i_restore_a_region_in_the_rit_state_for_a_long_time.rst new file mode 100644 index 0000000..1a68506 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/how_do_i_restore_a_region_in_the_rit_state_for_a_long_time.rst @@ -0,0 +1,22 @@ +:original_name: mrs_01_1644.html + +.. _mrs_01_1644: + +How Do I Restore a Region in the RIT State for a Long Time? +=========================================================== + +Question +-------- + +How do I restore a region in the RIT state for a long time? + +Answer +------ + +Log in to the HMaster Web UI, choose **Procedure & Locks** in the navigation tree, and check whether any process ID is in the **Waiting** state. If yes, run the following command to release the procedure lock: + +**hbase hbck -j** *Client installation directory*\ **/HBase/hbase/tools/hbase-hbck2-*.jar bypass -o** *pid* + +Check whether the state is in the **Bypass** state. If the procedure on the UI is always in **RUNNABLE(Bypass)** state, perform an active/standby switchover. Run the **assigns** command to bring the region online again. + +**hbase hbck -j** *Client installation directory*\ **/HBase/hbase/tools/hbase-hbck2-*.jar assigns -o** *regionName* diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/index.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/index.rst new file mode 100644 index 0000000..e50b3cc --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/index.rst @@ -0,0 +1,62 @@ +:original_name: mrs_01_1638.html + +.. _mrs_01_1638: + +Common Issues About HBase +========================= + +- :ref:`Why Does a Client Keep Failing to Connect to a Server for a Long Time? ` +- :ref:`Operation Failures Occur in Stopping BulkLoad On the Client ` +- :ref:`Why May a Table Creation Exception Occur When HBase Deletes or Creates the Same Table Consecutively? ` +- :ref:`Why Other Services Become Unstable If HBase Sets up A Large Number of Connections over the Network Port? ` +- :ref:`Why Does the HBase BulkLoad Task (One Table Has 26 TB Data) Consisting of 210,000 Map Tasks and 10,000 Reduce Tasks Fail? ` +- :ref:`How Do I Restore a Region in the RIT State for a Long Time? ` +- :ref:`Why Does HMaster Exits Due to Timeout When Waiting for the Namespace Table to Go Online? ` +- :ref:`Why Does SocketTimeoutException Occur When a Client Queries HBase? ` +- :ref:`Why Modified and Deleted Data Can Still Be Queried by Using the Scan Command? ` +- :ref:`Why "java.lang.UnsatisfiedLinkError: Permission denied" exception thrown while starting HBase shell? ` +- :ref:`When does the RegionServers listed under "Dead Region Servers" on HMaster WebUI gets cleared? ` +- :ref:`Why Are Different Query Results Returned After I Use Same Query Criteria to Query Data Successfully Imported by HBase bulkload? ` +- :ref:`What Should I Do If I Fail to Create Tables Due to the FAILED_OPEN State of Regions? ` +- :ref:`How Do I Delete Residual Table Names in the /hbase/table-lock Directory of ZooKeeper? ` +- :ref:`Why Does HBase Become Faulty When I Set a Quota for the Directory Used by HBase in HDFS? ` +- :ref:`Why HMaster Times Out While Waiting for Namespace Table to be Assigned After Rebuilding Meta Using OfflineMetaRepair Tool and Startups Failed ` +- :ref:`Why Messages Containing FileNotFoundException and no lease Are Frequently Displayed in the HMaster Logs During the WAL Splitting Process? ` +- :ref:`Insufficient Rights When a Tenant Accesses Phoenix ` +- :ref:`What Can I Do When HBase Fails to Recover a Task and a Message Is Displayed Stating "Rollback recovery failed"? ` +- :ref:`How Do I Fix Region Overlapping? ` +- :ref:`Why Does RegionServer Fail to Be Started When GC Parameters Xms and Xmx of HBase RegionServer Are Set to 31 GB? ` +- :ref:`Why Does the LoadIncrementalHFiles Tool Fail to Be Executed and "Permission denied" Is Displayed When Nodes in a Cluster Are Used to Import Data in Batches? ` +- :ref:`Why Is the Error Message "import argparse" Displayed When the Phoenix sqlline Script Is Used? ` +- :ref:`How Do I Deal with the Restrictions of the Phoenix BulkLoad Tool? ` +- :ref:`Why a Message Is Displayed Indicating that the Permission is Insufficient When CTBase Connects to the Ranger Plug-ins? ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + why_does_a_client_keep_failing_to_connect_to_a_server_for_a_long_time + operation_failures_occur_in_stopping_bulkload_on_the_client + why_may_a_table_creation_exception_occur_when_hbase_deletes_or_creates_the_same_table_consecutively + why_other_services_become_unstable_if_hbase_sets_up_a_large_number_of_connections_over_the_network_port + why_does_the_hbase_bulkload_task_one_table_has_26_tb_data_consisting_of_210,000_map_tasks_and_10,000_reduce_tasks_fail + how_do_i_restore_a_region_in_the_rit_state_for_a_long_time + why_does_hmaster_exits_due_to_timeout_when_waiting_for_the_namespace_table_to_go_online + why_does_sockettimeoutexception_occur_when_a_client_queries_hbase + why_modified_and_deleted_data_can_still_be_queried_by_using_the_scan_command + why_java.lang.unsatisfiedlinkerror_permission_denied_exception_thrown_while_starting_hbase_shell + when_does_the_regionservers_listed_under_dead_region_servers_on_hmaster_webui_gets_cleared + why_are_different_query_results_returned_after_i_use_same_query_criteria_to_query_data_successfully_imported_by_hbase_bulkload + what_should_i_do_if_i_fail_to_create_tables_due_to_the_failed_open_state_of_regions + how_do_i_delete_residual_table_names_in_the__hbase_table-lock_directory_of_zookeeper + why_does_hbase_become_faulty_when_i_set_a_quota_for_the_directory_used_by_hbase_in_hdfs + why_hmaster_times_out_while_waiting_for_namespace_table_to_be_assigned_after_rebuilding_meta_using_offlinemetarepair_tool_and_startups_failed + why_messages_containing_filenotfoundexception_and_no_lease_are_frequently_displayed_in_the_hmaster_logs_during_the_wal_splitting_process + insufficient_rights_when_a_tenant_accesses_phoenix + what_can_i_do_when_hbase_fails_to_recover_a_task_and_a_message_is_displayed_stating_rollback_recovery_failed + how_do_i_fix_region_overlapping + why_does_regionserver_fail_to_be_started_when_gc_parameters_xms_and_xmx_of_hbase_regionserver_are_set_to_31_gb + why_does_the_loadincrementalhfiles_tool_fail_to_be_executed_and_permission_denied_is_displayed_when_nodes_in_a_cluster_are_used_to_import_data_in_batches + why_is_the_error_message_import_argparse_displayed_when_the_phoenix_sqlline_script_is_used + how_do_i_deal_with_the_restrictions_of_the_phoenix_bulkload_tool + why_a_message_is_displayed_indicating_that_the_permission_is_insufficient_when_ctbase_connects_to_the_ranger_plug-ins diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/insufficient_rights_when_a_tenant_accesses_phoenix.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/insufficient_rights_when_a_tenant_accesses_phoenix.rst new file mode 100644 index 0000000..131fb04 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/insufficient_rights_when_a_tenant_accesses_phoenix.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_1657.html + +.. _mrs_01_1657: + +Insufficient Rights When a Tenant Accesses Phoenix +================================================== + +Question +-------- + +When a tenant accesses Phoenix, a message is displayed indicating that the tenant has insufficient rights. + +Answer +------ + +You need to associate the HBase service and Yarn queues when creating a tenant. + +The tenant must be granted additional rights to perform operations on Phoenix, that is, the RWX permission on the Phoenix system table. + +Example: + +Tenant **hbase** has been created. Log in to the HBase Shell as user **admin** and run the **scan 'hbase:acl'** command to query the role of the tenant. The role is **hbase_1450761169920** (in the format of tenant name_timestamp). + +Run the following commands to grant rights to the tenant (if the Phoenix system table has not been generated, log in to the Phoenix client as user **admin** first and then grant rights on the HBase Shell): + +**grant '@hbase_1450761169920','RWX','SYSTEM.CATALOG'** + +**grant '@hbase_1450761169920','RWX','SYSTEM.FUNCTION'** + +**grant '@hbase_1450761169920','RWX','SYSTEM.SEQUENCE'** + +**grant '@hbase_1450761169920','RWX','SYSTEM.STATS'** + +Create user **phoenix** and bind it with tenant **hbase**, so that tenant **hbase** can access the Phoenix client as user **phoenix**. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/operation_failures_occur_in_stopping_bulkload_on_the_client.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/operation_failures_occur_in_stopping_bulkload_on_the_client.rst new file mode 100644 index 0000000..2ed1e64 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/operation_failures_occur_in_stopping_bulkload_on_the_client.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_1640.html + +.. _mrs_01_1640: + +Operation Failures Occur in Stopping BulkLoad On the Client +=========================================================== + +Question +-------- + +Why submitted operations fail by stopping BulkLoad on the client during BulkLoad data importing? + +Answer +------ + +When BulkLoad is enabled on the client, a partitioner file is generated and used to demarcate the range of Map task data inputting. The file is automatically deleted when BulkLoad exists on the client. In general, if all map tasks are enabled and running, the termination of BulkLoad on the client does not cause the failure of submitted operations. However, due to the retry and speculative execution mechanism of Map tasks, a Map task is performed again if failures of the Reduce task to download the data of the completed Map task exceed the limit. In this case, if BulkLoad already exists on the client, the retry Map task fails and the operation failure occurs because the partitioner file is missing. Therefore, it is recommended not to stop BulkLoad on the client during BulkLoad data importing. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/what_can_i_do_when_hbase_fails_to_recover_a_task_and_a_message_is_displayed_stating_rollback_recovery_failed.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/what_can_i_do_when_hbase_fails_to_recover_a_task_and_a_message_is_displayed_stating_rollback_recovery_failed.rst new file mode 100644 index 0000000..29507e3 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/what_can_i_do_when_hbase_fails_to_recover_a_task_and_a_message_is_displayed_stating_rollback_recovery_failed.rst @@ -0,0 +1,42 @@ +:original_name: mrs_01_1659.html + +.. _mrs_01_1659: + +What Can I Do When HBase Fails to Recover a Task and a Message Is Displayed Stating "Rollback recovery failed"? +=============================================================================================================== + +Question +-------- + +The system automatically rolls back data after an HBase recovery task fails. If "Rollback recovery failed" is displayed, the rollback fails. After the rollback fails, data stops being processed and the junk data may be generated. How can I resolve this problem? + +Answer +------ + +You need to manually clear the junk data before performing the backup or recovery task next time. + +#. Install the cluster client in **/opt/client**. + +#. Run **source /opt/client/bigdata_env** as the client installation user to configure the environment variable. + +#. Run **kinit admin**. + +#. Run **zkCli.sh -server** *business IP address of ZooKeeper*\ **:2181** to connect to the ZooKeeper. + +#. Run **deleteall /recovering** to delete the junk data. Run **quit** to disconnect ZooKeeper. + + .. note:: + + Running this command will cause data loss. Exercise caution. + +#. Run **hdfs dfs -rm -f -r /user/hbase/backup** to delete temporary data. + +#. Log in to FusionInsight Manager and choose **O&M**. In the navigation pane on the left, choose **Backup and Restoration** > **Restoration Management**. In the task list, locate the row that contains the target task and click **View History** in the **Operation** column. In the displayed dialog box, click |image1| before a specified execution record to view the snapshot name. + + .. code-block:: + + Snapshot [ snapshot name ] is created successfully before recovery. + +#. Switch to the client, run **hbase shell**, and then **delete_all_snapshot** '*snapshot name*\ **.*'** to delete the temporary snapshot. + +.. |image1| image:: /_static/images/en-us_image_0000001349170329.png diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/what_should_i_do_if_i_fail_to_create_tables_due_to_the_failed_open_state_of_regions.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/what_should_i_do_if_i_fail_to_create_tables_due_to_the_failed_open_state_of_regions.rst new file mode 100644 index 0000000..6707654 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/what_should_i_do_if_i_fail_to_create_tables_due_to_the_failed_open_state_of_regions.rst @@ -0,0 +1,32 @@ +:original_name: mrs_01_1651.html + +.. _mrs_01_1651: + +What Should I Do If I Fail to Create Tables Due to the FAILED_OPEN State of Regions? +==================================================================================== + +Question +-------- + +What should I do if I fail to create tables due to the FAILED_OPEN state of Regions? + +Answer +------ + +If a network, HDFS, or Active HMaster fault occurs during the creation of tables, some Regions may fail to go online and therefore enter the FAILED_OPEN state. In this case, tables fail to be created. + +The tables that fail to be created due to the preceding mentioned issue cannot be repaired. To solve this problem, perform the following operations to delete and re-create the tables: + +#. Run the following command on the cluster client to repair the state of the tables: + + **hbase hbck -fixTableStates** + +#. Enter the HBase shell and run the following commands to delete the tables that fail to be created: + + **truncate** *''* + + **disable** *''* + + **drop** *''* + +#. Create the tables using the recreation command. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/when_does_the_regionservers_listed_under_dead_region_servers_on_hmaster_webui_gets_cleared.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/when_does_the_regionservers_listed_under_dead_region_servers_on_hmaster_webui_gets_cleared.rst new file mode 100644 index 0000000..9f5b650 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/when_does_the_regionservers_listed_under_dead_region_servers_on_hmaster_webui_gets_cleared.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_1649.html + +.. _mrs_01_1649: + +When does the RegionServers listed under "Dead Region Servers" on HMaster WebUI gets cleared? +============================================================================================= + +Question +-------- + +When does the RegionServers listed under "Dead Region Servers" on HMaster WebUI gets cleared? + +Answer +------ + +When an online RegionServer goes down abruptly, it is displayed under "Dead Region Servers" in the HMaster WebUI. When dead RegionServer restarts and reports back to HMaster successfully, the "Dead Region Servers" in the HMaster WebUI gets cleared. + +The "Dead Region Servers" is also gets cleared, when the HMaster failover operation is performed successfully. + +In cases when an Active HMaster hosting some regions is abruptly killed, Backup HMaster will become the new Active HMater and displays previous Active HMaster as dead RegionServer. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_a_message_is_displayed_indicating_that_the_permission_is_insufficient_when_ctbase_connects_to_the_ranger_plug-ins.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_a_message_is_displayed_indicating_that_the_permission_is_insufficient_when_ctbase_connects_to_the_ranger_plug-ins.rst new file mode 100644 index 0000000..0920711 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_a_message_is_displayed_indicating_that_the_permission_is_insufficient_when_ctbase_connects_to_the_ranger_plug-ins.rst @@ -0,0 +1,39 @@ +:original_name: mrs_01_2212.html + +.. _mrs_01_2212: + +Why a Message Is Displayed Indicating that the Permission is Insufficient When CTBase Connects to the Ranger Plug-ins? +====================================================================================================================== + +Question +-------- + +When CTBase accesses the HBase service with the Ranger plug-ins enabled and you are creating a cluster table, a message is displayed indicating that the permission is insufficient. + +.. code-block:: + + ERROR: Create ClusterTable failed. Error: org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient permissions for user 'ctbase2@HADOOP.COM' (action=create) + at org.apache.ranger.authorization.hbase.AuthorizationSession.publishResults(AuthorizationSession.java:278) + at org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor.authorizeAccess(RangerAuthorizationCoprocessor.java:654) + at org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor.requirePermission(RangerAuthorizationCoprocessor.java:772) + at org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor.preCreateTable(RangerAuthorizationCoprocessor.java:943) + at org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor.preCreateTable(RangerAuthorizationCoprocessor.java:428) + at org.apache.hadoop.hbase.master.MasterCoprocessorHost$12.call(MasterCoprocessorHost.java:351) + at org.apache.hadoop.hbase.master.MasterCoprocessorHost$12.call(MasterCoprocessorHost.java:348) + at org.apache.hadoop.hbase.coprocessor.CoprocessorHost$ObserverOperationWithoutResult.callObserver(CoprocessorHost.java:581) + at org.apache.hadoop.hbase.coprocessor.CoprocessorHost.execOperation(CoprocessorHost.java:655) + at org.apache.hadoop.hbase.master.MasterCoprocessorHost.preCreateTable(MasterCoprocessorHost.java:348) + at org.apache.hadoop.hbase.master.HMaster$5.run(HMaster.java:2192) + at org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.submitProcedure(MasterProcedureUtil.java:134) + at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:2189) + at org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:711) + at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) + at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:458) + at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) + at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) + at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) + +Answer +------ + +CTBase users can configure permission policies on the Ranger page and grant the READ, WRITE, CREATE, ADMIN, and EXECUTE permissions to the CTBase metadata table **\_ctmeta\_**, cluster table, and index table. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_are_different_query_results_returned_after_i_use_same_query_criteria_to_query_data_successfully_imported_by_hbase_bulkload.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_are_different_query_results_returned_after_i_use_same_query_criteria_to_query_data_successfully_imported_by_hbase_bulkload.rst new file mode 100644 index 0000000..c95af15 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_are_different_query_results_returned_after_i_use_same_query_criteria_to_query_data_successfully_imported_by_hbase_bulkload.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_1650.html + +.. _mrs_01_1650: + +Why Are Different Query Results Returned After I Use Same Query Criteria to Query Data Successfully Imported by HBase bulkload? +=============================================================================================================================== + +Question +-------- + +If the data to be imported by HBase bulkload has identical rowkeys, the data import is successful but identical query criteria produce different query results. + +Answer +------ + +Data with an identical rowkey is loaded into HBase in the order in which data is read. The data with the latest timestamp is considered to be the latest data. By default, data is not queried by timestamp. Therefore, if you query for data with an identical rowkey, only the latest data is returned. + +While data is being loaded by bulkload, the memory processes the data into HFiles quickly, leading to the possibility that data with an identical rowkey has a same timestamp. In this case, identical query criteria may produce different query results. + +To avoid this problem, ensure that the same data file does not contain identical rowkeys while you are creating tables or loading data. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_a_client_keep_failing_to_connect_to_a_server_for_a_long_time.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_a_client_keep_failing_to_connect_to_a_server_for_a_long_time.rst new file mode 100644 index 0000000..8a030c3 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_a_client_keep_failing_to_connect_to_a_server_for_a_long_time.rst @@ -0,0 +1,47 @@ +:original_name: mrs_01_1639.html + +.. _mrs_01_1639: + +Why Does a Client Keep Failing to Connect to a Server for a Long Time? +====================================================================== + +Question +-------- + +A HBase server is faulty and cannot provide services. In this case, when a table operation is performed on the HBase client, why is the operation suspended and no response is received for a long time? + +Answer +------ + +**Problem Analysis** + +When the HBase server malfunctions, the table operation request from the HBase client is tried for several times and times out. The default timeout value is **Integer.MAX_VALUE (2147483647 ms)**. The table operation request is retired constantly during such a long period of time and is suspended at last. + +**Solution** + +The HBase client provides two configuration items to configure the retry and timeout of the client. :ref:`Table 1 ` describes them. + +Set the following parameters in the **Client installation path/HBase/hbase/conf/hbase-site.xml** configuration file: + +.. _mrs_01_1639__te9ce661d0c4a4745b801616b66b97321: + +.. table:: **Table 1** Configuration parameters of retry and timeout + + +--------------------------------+-----------------------------------------------------------------------------------------------------+---------------+ + | Parameter | Description | Default Value | + +================================+=====================================================================================================+===============+ + | hbase.client.operation.timeout | Client operation timeout period You need to manually add the information to the configuration file. | 2147483647 ms | + +--------------------------------+-----------------------------------------------------------------------------------------------------+---------------+ + | hbase.client.retries.number | Maximum retry times supported by all retryable operations. | 35 | + +--------------------------------+-----------------------------------------------------------------------------------------------------+---------------+ + +:ref:`Figure 1 ` describes the working principles of retry and timeout. + +.. _mrs_01_1639__fc7b1b6a1826d4b98bb68d1fd842512cb: + +.. figure:: /_static/images/en-us_image_0000001296090588.jpg + :alt: **Figure 1** Process for HBase client operation retry timeout + + **Figure 1** Process for HBase client operation retry timeout + +The process indicates that a suspension occurs if the preceding parameters are not configured based on site requirements. It is recommended that a proper timeout period be set based on scenarios. If the operation takes a long time, set a long timeout period. If the operation takes a shot time, set a short timeout period. The number of retries can be set to **(hbase.client.retries.number)*60*1000(ms)**. The timeout period can be slightly greater than **hbase.client.operation.timeout**. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_hbase_become_faulty_when_i_set_a_quota_for_the_directory_used_by_hbase_in_hdfs.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_hbase_become_faulty_when_i_set_a_quota_for_the_directory_used_by_hbase_in_hdfs.rst new file mode 100644 index 0000000..d91e3f0 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_hbase_become_faulty_when_i_set_a_quota_for_the_directory_used_by_hbase_in_hdfs.rst @@ -0,0 +1,57 @@ +:original_name: mrs_01_1653.html + +.. _mrs_01_1653: + +Why Does HBase Become Faulty When I Set a Quota for the Directory Used by HBase in HDFS? +======================================================================================== + +Question +-------- + +Why does HBase become faulty when I set quota for the directory used by HBase in HDFS? + +Answer +------ + +The flush operation of a table is to write memstore data to HDFS. + +If the HDFS directory does not have sufficient disk space quota, the flush operation will fail and the region server will stop. + +.. code-block:: + + Caused by: org.apache.hadoop.hdfs.protocol.DSQuotaExceededException: The DiskSpace quota of /hbase/data// is exceeded: quota = 1024 B = 1 KB but diskspace consumed = 402655638 B = 384.00 MB + ?at org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyStoragespaceQuota(DirectoryWithQuotaFeature.java:211) + ?at org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyQuota(DirectoryWithQuotaFeature.java:239) + ?at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyQuota(FSDirectory.java:882) + ?at org.apache.hadoop.hdfs.server.namenode.FSDirectory.updateCount(FSDirectory.java:711) + ?at org.apache.hadoop.hdfs.server.namenode.FSDirectory.updateCount(FSDirectory.java:670) + ?at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addBlock(FSDirectory.java:495) + +In the preceding exception, the disk space quota of the **/hbase/data//** table is 1 KB, but the memstore data is 384.00 MB. Therefore, the flush operation fails and the region server stops. + +When the region server is terminated, HMaster replays the WAL file of the terminated region server to restore data. The disk space quota is limited. As a result, the replay operation of the WAL file fails, and the HMaster process exits unexpectedly. + +.. code-block:: + + 2016-07-28 19:11:40,352 | FATAL | MASTER_SERVER_OPERATIONS-10-91-9-131:16000-0 | Caught throwable while processing event M_SERVER_SHUTDOWN | org.apache.hadoop.hbase.master.HMaster.abort(HMaster.java:2474) + java.io.IOException: failed log splitting for 10-91-9-131,16020,1469689987884, will retry + ?at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.resubmit(ServerShutdownHandler.java:365) + ?at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:220) + ?at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) + ?at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) + ?at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) + ?at java.lang.Thread.run(Thread.java:745) + Caused by: java.io.IOException: error or interrupted while splitting logs in [hdfs://hacluster/hbase/WALs/,,-splitting] Task = installed = 6 done = 3 error = 3 + ?at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290) + ?at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:402) + ?at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:375) + +Therefore, you cannot set the quota value for the HBase directory in HDFS. If the exception occurs, perform the following operations: + +#. Run the **kinit** *Username* command on the client to enable the HBase user to obtain security authentication. + +#. Run the **hdfs dfs -count -q** */hbase/data//* command to check the allocated disk space quota. + +#. Run the following command to cancel the quota limit and restore HBase: + + **hdfs dfsadmin -clrSpaceQuota** */hbase/data//* diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_hmaster_exits_due_to_timeout_when_waiting_for_the_namespace_table_to_go_online.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_hmaster_exits_due_to_timeout_when_waiting_for_the_namespace_table_to_go_online.rst new file mode 100644 index 0000000..4695b78 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_hmaster_exits_due_to_timeout_when_waiting_for_the_namespace_table_to_go_online.rst @@ -0,0 +1,51 @@ +:original_name: mrs_01_1645.html + +.. _mrs_01_1645: + +Why Does HMaster Exits Due to Timeout When Waiting for the Namespace Table to Go Online? +======================================================================================== + +Question +-------- + +Why does HMaster exit due to timeout when waiting for the namespace table to go online? + +Answer +------ + +During the HMaster active/standby switchover or startup, HMaster performs WAL splitting and region recovery for the RegionServer that failed or was stopped previously. + +Multiple threads are running in the background to monitor the HMaster startup process. + +- TableNamespaceManager + + This is a help class, which is used to manage the allocation of namespace tables and monitoring table regions during HMaster active/standby switchover or startup. If the namespace table is not online within the specified time (**hbase.master.namespace.init.timeout**, which is 3,600,000 ms by default), the thread terminates HMaster abnormally. + +- InitializationMonitor + + This is an initialization thread monitoring class of the primary HMaster, which is used to monitor the initialization of the primary HMaster. If a thread fails to be initialized within the specified time (**hbase.master.initializationmonitor.timeout**, which is 3,600,000 ms by default), the thread terminates HMaster abnormally. If **hbase.master.initializationmonitor.haltontimeout** is started, the default value is **false**. + +During the HMaster active/standby switchover or startup, if the **WAL hlog** file exists, the WAL splitting task is initialized. If the WAL hlog splitting task is complete, it initializes the table region allocation task. + +HMaster uses ZooKeeper to coordinate log splitting tasks and valid RegionServers and track task development. If the primary HMaster exits during the log splitting task, the new primary HMaster attempts to resend the unfinished task, and RegionServer starts the log splitting task from the beginning. + +The initialization of the HMaster is delayed due to the following reasons: + +- Network faults occur intermittently. +- Disks run into bottlenecks. +- The log splitting task is overloaded, and RegionServer runs slowly. +- RegionServer (region opening) responds slowly. + +In the preceding scenarios, you are advised to add the following configuration parameters to enable HMaster to complete the restoration task earlier. Otherwise, the Master will exit, causing a longer delay of the entire restoration process. + +- Increase the online waiting timeout period of the namespace table to ensure that the Master has enough time to coordinate the splitting tasks of the RegionServer worker and avoid repeated tasks. + + **hbase.master.namespace.init.timeout** (default value: 3,600,000 ms) + +- Increase the number of concurrent splitting tasks through RegionServer worker to ensure that RegionServer worker can process splitting tasks in parallel (RegionServers need more cores). Add the following parameters to *Client installation path* **/HBase/hbase/conf/hbase-site.xml**: + + **hbase.regionserver.wal.max.splitters** (default value: 2) + +- If all restoration processes require time, increase the timeout period for initializing the monitoring thread. + + **hbase.master.initializationmonitor.timeout** (default value: 3,600,000 ms) diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_regionserver_fail_to_be_started_when_gc_parameters_xms_and_xmx_of_hbase_regionserver_are_set_to_31_gb.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_regionserver_fail_to_be_started_when_gc_parameters_xms_and_xmx_of_hbase_regionserver_are_set_to_31_gb.rst new file mode 100644 index 0000000..d277f5b --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_regionserver_fail_to_be_started_when_gc_parameters_xms_and_xmx_of_hbase_regionserver_are_set_to_31_gb.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_1661.html + +.. _mrs_01_1661: + +Why Does RegionServer Fail to Be Started When GC Parameters Xms and Xmx of HBase RegionServer Are Set to 31 GB? +=============================================================================================================== + +Question +-------- + +(MRS 3.x and later versions) Check the **hbase-omm-*.out** log of the node where RegionServer fails to be started. It is found that the log contains **An error report file with more information is saved as: /tmp/hs_err_pid*.log**. Check the **/tmp/hs_err_pid*.log** file. It is found that the log contains **#Internal Error (vtableStubs_aarch64.cpp:213), pid=9456, tid=0x0000ffff97fdd200 and #guarantee(_\_ pc() <= s->code_end()) failed: overflowed buffer**, indicating that the problem is caused by JDK. How do I solve this problem? + +Answer +------ + +To rectify the fault, perform the following steps: + +#. Run the **su - omm** command on a node where RegionServer fails to be started to switch to user **omm**. + +#. Run the **java -XX:+PrintFlagsFinal -version \|grep HeapBase** command as user **omm**. Information similar to the following is displayed: + + .. code-block:: + + uintx HeapBaseMinAddress = 2147483648 {pd product} + +#. Change the values of **-Xms** and **-Xmx** in **GC_OPTS** to values that are not between **32G-HeapBaseMinAddress** and **32G**, excluding the values of **32G** and **32G-HeapBaseMinAddress**. + +#. Log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Instance**, select the failed instance, and choose **More** > **Restart Instance** to restart the failed instance. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_sockettimeoutexception_occur_when_a_client_queries_hbase.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_sockettimeoutexception_occur_when_a_client_queries_hbase.rst new file mode 100644 index 0000000..9b2e5db --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_sockettimeoutexception_occur_when_a_client_queries_hbase.rst @@ -0,0 +1,56 @@ +:original_name: mrs_01_1646.html + +.. _mrs_01_1646: + +Why Does SocketTimeoutException Occur When a Client Queries HBase? +================================================================== + +Question +-------- + +Why does the following exception occur on the client when I use the HBase client to operate table data? + +.. code-block:: + + 2015-12-15 02:41:14,054 | WARN | [task-result-getter-2] | Lost task 2.0 in stage 58.0 (TID 3288, linux-175): + org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions: + Tue Dec 15 02:41:14 CST 2015, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=60303: + row 'xxxxxx' on table 'xxxxxx' at region=xxxxxx,\x05\x1E\x80\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x000\x00\x80\x00\x00\x00\x80\x00\x00\x00\x80\x00\x00, + 1449912620868.6a6b7d0c272803d8186930a3bfdb10a9., hostname=xxxxxx,16020,1449941841479, seqNum=5 + at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:275) + at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:223) + at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:61) + at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) + at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:323) + +At the same time, the following log is displayed on RegionServer: + +.. code-block:: + + 2015-12-15 02:45:44,551 | WARN | PriorityRpcServer.handler=7,queue=1,port=16020 | (responseTooSlow): {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest) + ","starttimems":1450118730780,"responsesize":416,"method":"Scan","processingtimems":13770,"client":"10.91.8.175:41182","queuetimems":0,"class":"HRegionServer"} | + org.apache.hadoop.hbase.ipc.RpcServer.logResponse(RpcServer.java:2221) + 2015-12-15 02:45:57,722 | WARN | PriorityRpcServer.handler=3,queue=1,port=16020 | (responseTooSlow): + {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","starttimems":1450118746297,"responsesize":416, + "method":"Scan","processingtimems":11425,"client":"10.91.8.175:41182","queuetimems":1746,"class":"HRegionServer"} | org.apache.hadoop.hbase.ipc.RpcServer.logResponse(RpcServer.java:2221) + 2015-12-15 02:47:21,668 | INFO | LruBlockCacheStatsExecutor | totalSize=7.54 GB, freeSize=369.52 MB, max=7.90 GB, blockCount=406107, + accesses=35400006, hits=16803205, hitRatio=47.47%, , cachingAccesses=31864266, cachingHits=14806045, cachingHitsRatio=46.47%, + evictions=17654, evicted=16642283, evictedPerRun=942.69189453125 | org.apache.hadoop.hbase.io.hfile.LruBlockCache.logStats(LruBlockCache.java:858) + 2015-12-15 02:52:21,668 | INFO | LruBlockCacheStatsExecutor | totalSize=7.51 GB, freeSize=395.34 MB, max=7.90 GB, blockCount=403080, + accesses=35685793, hits=16933684, hitRatio=47.45%, , cachingAccesses=32150053, cachingHits=14936524, cachingHitsRatio=46.46%, + evictions=17684, evicted=16800617, evictedPerRun=950.046142578125 | org.apache.hadoop.hbase.io.hfile.LruBlockCache.logStats(LruBlockCache.java:858) + +Answer +------ + +The memory allocated to RegionServer is too small and the number of Regions is too large. As a result, the memory is insufficient during the running, and the server responds slowly to the client. Modify the following memory allocation parameters in the **hbase-site.xml** configuration file of RegionServer: + +.. table:: **Table 1** RegionServer memory allocation parameters + + +------------------------+-----------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Default Value | + +========================+=====================================================================================================+=========================================================================================================================+ + | GC_OPTS | Initial memory and maximum memory allocated to RegionServer in startup parameters. | -Xms8G -Xmx8G | + +------------------------+-----------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ + | hfile.block.cache.size | Percentage of the maximum heap (-Xmx setting) allocated to the block cache of HFiles or StoreFiles. | When **offheap** is disabled, the default value is **0.25**. When **offheap** is enabled, the default value is **0.1**. | + +------------------------+-----------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_the_hbase_bulkload_task_one_table_has_26_tb_data_consisting_of_210,000_map_tasks_and_10,000_reduce_tasks_fail.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_the_hbase_bulkload_task_one_table_has_26_tb_data_consisting_of_210,000_map_tasks_and_10,000_reduce_tasks_fail.rst new file mode 100644 index 0000000..727f509 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_the_hbase_bulkload_task_one_table_has_26_tb_data_consisting_of_210,000_map_tasks_and_10,000_reduce_tasks_fail.rst @@ -0,0 +1,25 @@ +:original_name: mrs_01_1643.html + +.. _mrs_01_1643: + +Why Does the HBase BulkLoad Task (One Table Has 26 TB Data) Consisting of 210,000 Map Tasks and 10,000 Reduce Tasks Fail? +========================================================================================================================= + +Question +-------- + +The HBase bulkLoad task (a single table contains 26 TB data) has 210,000 maps and 10,000 reduce tasks (in MRS 3.x or later), and the task fails. + +Answer +------ + +**ZooKeeper I/O bottleneck observation methods:** + +#. On the monitoring page of Manager, check whether the number of ZooKeeper requests on a single node exceeds the upper limit. +#. View ZooKeeper and HBase logs to check whether a large number of I/O Exception Timeout or SocketTimeout Exception exceptions occur. + +**Optimization suggestions:** + +#. Change the number of ZooKeeper instances to 5 or more. You are advised to set **peerType** to **observer** to increase the number of observers. +#. Control the number of concurrent maps of a single task or reduce the memory for running tasks on each node to lighten the node load. +#. Upgrade ZooKeeper data disks, such as SSDs. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_the_loadincrementalhfiles_tool_fail_to_be_executed_and_permission_denied_is_displayed_when_nodes_in_a_cluster_are_used_to_import_data_in_batches.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_the_loadincrementalhfiles_tool_fail_to_be_executed_and_permission_denied_is_displayed_when_nodes_in_a_cluster_are_used_to_import_data_in_batches.rst new file mode 100644 index 0000000..5be5252 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_does_the_loadincrementalhfiles_tool_fail_to_be_executed_and_permission_denied_is_displayed_when_nodes_in_a_cluster_are_used_to_import_data_in_batches.rst @@ -0,0 +1,72 @@ +:original_name: mrs_01_0625.html + +.. _mrs_01_0625: + +Why Does the LoadIncrementalHFiles Tool Fail to Be Executed and "Permission denied" Is Displayed When Nodes in a Cluster Are Used to Import Data in Batches? +============================================================================================================================================================ + +Question +-------- + +Why does the LoadIncrementalHFiles tool fail to be executed and "Permission denied" is displayed when a Linux user is manually created in a normal cluster and DataNode in the cluster is used to import data in batches? + +.. code-block:: + + 2020-09-20 14:53:53,808 WARN [main] shortcircuit.DomainSocketFactory: error creating DomainSocket + java.net.ConnectException: connect(2) error: Permission denied when trying to connect to '/var/run/FusionInsight-HDFS/dn_socket' + at org.apache.hadoop.net.unix.DomainSocket.connect0(Native Method) + at org.apache.hadoop.net.unix.DomainSocket.connect(DomainSocket.java:256) + at org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory.createSocket(DomainSocketFactory.java:168) + at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextDomainPeer(BlockReaderFactory.java:804) + at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:526) + at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:785) + at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:722) + at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:483) + at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:360) + at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:663) + at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:594) + at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) + at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:845) + at java.io.DataInputStream.readFully(DataInputStream.java:195) + at org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:401) + at org.apache.hadoop.hbase.io.hfile.HFile.isHFileFormat(HFile.java:651) + at org.apache.hadoop.hbase.io.hfile.HFile.isHFileFormat(HFile.java:634) + at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.visitBulkHFiles(LoadIncrementalHFiles.java:1090) + at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.discoverLoadQueue(LoadIncrementalHFiles.java:1006) + at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.prepareHFileQueue(LoadIncrementalHFiles.java:257) + at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:364) + at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:1263) + at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:1276) + at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:1311) + at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) + at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.main(LoadIncrementalHFiles.java:1333) + +Answer +------ + +If the client that the LoadIncrementalHFiles tool depends on is installed in the cluster and is on the same node as DataNode, HDFS creates short-circuit read during the execution of the tool to improve performance. The short-circuit read depends on the **/var/run/FusionInsight-HDFS** directory (**dfs.domain.socket.path**). The default permission on this directory is **750**. This user does not have the permission to operate the directory. + +To solve the preceding problem, perform the following operations: + +Method 1: Create a user (recommended). + +#. Create a user on Manager. By default, the user group contains the **ficommon** group. + + .. code-block:: console + + [root@xxx-xxx-xxx-xxx ~]# id test + uid=20038(test) gid=9998(ficommon) groups=9998(ficommon) + +#. Import data again. + +Method 2: Change the owner group of the current user. + +#. Add the user to the **ficommon** group. + + .. code-block:: console + + [root@xxx-xxx-xxx-xxx ~]# usermod -a -G ficommon test + [root@xxx-xxx-xxx-xxx ~]# id test + uid=2102(test) gid=2102(test) groups=2102(test),9998(ficommon) + +#. Import data again. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_hmaster_times_out_while_waiting_for_namespace_table_to_be_assigned_after_rebuilding_meta_using_offlinemetarepair_tool_and_startups_failed.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_hmaster_times_out_while_waiting_for_namespace_table_to_be_assigned_after_rebuilding_meta_using_offlinemetarepair_tool_and_startups_failed.rst new file mode 100644 index 0000000..6905c85 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_hmaster_times_out_while_waiting_for_namespace_table_to_be_assigned_after_rebuilding_meta_using_offlinemetarepair_tool_and_startups_failed.rst @@ -0,0 +1,35 @@ +:original_name: mrs_01_1654.html + +.. _mrs_01_1654: + +Why HMaster Times Out While Waiting for Namespace Table to be Assigned After Rebuilding Meta Using OfflineMetaRepair Tool and Startups Failed +============================================================================================================================================= + +Question +-------- + +Why HMaster times out while waiting for namespace table to be assigned after rebuilding meta using OfflineMetaRepair tool and startups failed? + +HMaster abort with following FATAL message, + +.. code-block:: + + 2017-06-15 15:11:07,582 FATAL [Hostname:16000.activeMasterManager] master.HMaster: Unhandled exception. Starting shutdown. + java.io.IOException: Timedout 120000ms waiting for namespace table to be assigned + at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:98) + at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:1054) + at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:848) + at org.apache.hadoop.hbase.master.HMaster.access$600(HMaster.java:199) + at org.apache.hadoop.hbase.master.HMaster$2.run(HMaster.java:1871) + at java.lang.Thread.run(Thread.java:745) + +Answer +------ + +When meta is rebuilt by OfflineMetaRepair tool then HMaster wait for all region server's WAL split during start up to avoid the data inconsistency problem. HMaster trigger user regions assignment once WAL split completes. So when the cluster is in the unusual scenario, there are chances WAL splitting may take long time which depends on multiple factors like too many WALs, slow I/O, region servers are not stable etc. + +HMaster should be able to finish all region server WAL splitting successfully. Perform the following steps. + +#. Make sure cluster is stable, no other problem exist. If any problem occurs, please correct them first. +#. Configure a large value to **hbase.master.initializationmonitor.timeout** parameters, default value is **3600000** milliseconds. +#. Restart HBase service. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_is_the_error_message_import_argparse_displayed_when_the_phoenix_sqlline_script_is_used.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_is_the_error_message_import_argparse_displayed_when_the_phoenix_sqlline_script_is_used.rst new file mode 100644 index 0000000..ea512e0 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_is_the_error_message_import_argparse_displayed_when_the_phoenix_sqlline_script_is_used.rst @@ -0,0 +1,17 @@ +:original_name: mrs_01_2210.html + +.. _mrs_01_2210: + +Why Is the Error Message "import argparse" Displayed When the Phoenix sqlline Script Is Used? +============================================================================================= + +Question +-------- + +When the sqlline script is used on the client, the error message "import argparse" is displayed. + +Answer +------ + +#. Log in to the node where the HBase client is installed as user **root**. Perform security authentication using the **hbase** user. +#. Go to the directory where the sqlline script of the HBase client is stored and run the **python3 sqlline.py** command. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_java.lang.unsatisfiedlinkerror_permission_denied_exception_thrown_while_starting_hbase_shell.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_java.lang.unsatisfiedlinkerror_permission_denied_exception_thrown_while_starting_hbase_shell.rst new file mode 100644 index 0000000..5f3c304 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_java.lang.unsatisfiedlinkerror_permission_denied_exception_thrown_while_starting_hbase_shell.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_1648.html + +.. _mrs_01_1648: + +Why "java.lang.UnsatisfiedLinkError: Permission denied" exception thrown while starting HBase shell? +==================================================================================================== + +Question +-------- + +Why "java.lang.UnsatisfiedLinkError: Permission denied" exception thrown while starting HBase shell? + +Answer +------ + +During HBase shell execution JRuby create temporary files under **java.io.tmpdir** path and default value of **java.io.tmpdir** is **/tmp**. If NOEXEC permission is set to /tmp directory then HBase shell start will fail with "java.lang.UnsatisfiedLinkError: Permission denied" exception. + +So "java.io.tmpdir" must be set to a different path in HBASE_OPTS/CLIENT_GC_OPTS if NOEXEC is set to /tmp directory. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_may_a_table_creation_exception_occur_when_hbase_deletes_or_creates_the_same_table_consecutively.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_may_a_table_creation_exception_occur_when_hbase_deletes_or_creates_the_same_table_consecutively.rst new file mode 100644 index 0000000..d286ba3 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_may_a_table_creation_exception_occur_when_hbase_deletes_or_creates_the_same_table_consecutively.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_1641.html + +.. _mrs_01_1641: + +Why May a Table Creation Exception Occur When HBase Deletes or Creates the Same Table Consecutively? +==================================================================================================== + +Question +-------- + +When HBase consecutively deletes and creates the same table, why may a table creation exception occur? + +Answer +------ + +Execution process: Disable Table > Drop Table > Create Table > Disable Table > Drop Table > And more + +#. When a table is disabled, HMaster sends an RPC request to RegionServer, and RegionServer brings the region offline. When the time required for closing a region on RegionServer exceeds the timeout period for HBase HMaster to wait for the region to enter the RIT state, HMaster considers that the region is offline by default. Actually, the region may be in the flush memstore phase. +#. After an RPC request is sent to close a region, HMaster checks whether all regions in the table are offline. If the closure times out, HMaster considers that the regions are offline and returns a message indicating that the regions are successfully closed. +#. After the closure is successful, the data directory corresponding to the HBase table is deleted. +#. After the table is deleted, the data directory is recreated by the region that is still in the flush memstore phase. +#. When the table is created again, the **temp** directory is copied to the HBase data directory. However, the HBase data directory is not empty. As a result, when the HDFS rename API is called, the data directory changes to the last layer of the **temp** directory and is appended to the HBase data directory, for example, **$rootDir/data/$nameSpace/$tableName/$tableName**. In this case, the table fails to be created. + +**Troubleshooting Method** + +When this problem occurs, check whether the HBase data directory corresponding to the table exists. If it exists, rename the directory. + +The HBase data directory consists of **$rootDir/data/$nameSpace/$tableName**, for example, **hdfs://hacluster/hbase/data/default/TestTable**. **$rootDir** is the HBase root directory, which can be obtained by configuring **hbase.rootdir.perms** in **hbase-site.xml**. The **data** directory is a fixed directory of HBase. **$nameSpace** indicates the nameSpace name. **$tableName** indicates the table name. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_messages_containing_filenotfoundexception_and_no_lease_are_frequently_displayed_in_the_hmaster_logs_during_the_wal_splitting_process.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_messages_containing_filenotfoundexception_and_no_lease_are_frequently_displayed_in_the_hmaster_logs_during_the_wal_splitting_process.rst new file mode 100644 index 0000000..620033c --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_messages_containing_filenotfoundexception_and_no_lease_are_frequently_displayed_in_the_hmaster_logs_during_the_wal_splitting_process.rst @@ -0,0 +1,59 @@ +:original_name: mrs_01_1655.html + +.. _mrs_01_1655: + +Why Messages Containing FileNotFoundException and no lease Are Frequently Displayed in the HMaster Logs During the WAL Splitting Process? +========================================================================================================================================= + +Question +-------- + +Why messages containing FileNotFoundException and no lease are frequently displayed in the HMaster logs during the WAL splitting process? + +.. code-block:: + + 2017-06-10 09:50:27,586 | ERROR | split-log-closeStream-2 | Couldn't close log at hdfs://hacluster/hbase/data/default/largeT1/2b48346d087275fe751fc049334fda93/recovered.edits/0000000000000000000.temp | org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1330) + java.io.FileNotFoundException: No lease on /hbase/data/default/largeT1/2b48346d087275fe751fc049334fda93/recovered.edits/0000000000000000000.temp (inode 1092653): File does not exist. [Lease. Holder: DFSClient_NONMAPREDUCE_1202985678_1, pendingcreates: 1936] + ?at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3432) + ?at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3223) + ?at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3057) + ?at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3011) + ?at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:842) + ?at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:526) + ?at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) + ?at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) + ?at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) + ?at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2260) + ?at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2256) + ?at java.security.AccessController.doPrivileged(Native Method) + ?at javax.security.auth.Subject.doAs(Subject.java:422) + ?at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1769) + ?at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2254) + + ?at sun.reflect.GeneratedConstructorAccessor40.newInstance(Unknown Source) + ?at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) + ?at java.lang.reflect.Constructor.newInstance(Constructor.java:423) + ?at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) + ?at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) + ?at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1842) + ?at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1639) + ?at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:665) + +Answer +------ + +During the WAL splitting process, the WAL splitting timeout period is specified by the **hbase.splitlog.manager.timeout** parameter. If the WAL splitting process fails to complete within the timeout period, the task is submitted again. Multiple WAL splitting tasks may be submitted during a specified period. If the **temp** file is deleted when one WAL splitting task completes, other tasks cannot find the file and the FileNotFoudException exception is reported. To avoid the problem, perform the following modifications: + +The default value of **hbase.splitlog.manager.timeout** is 600,000 ms. The cluster specification is that each RegionServer has 2,000 to 3,000 regions. When the cluster is normal (HBase is normal and HDFS does not have a large number of read and write operations), you are advised to adjust this parameter based on the cluster specifications. If the actual specifications (the actual average number of regions on each RegionServer) are greater than the default specifications (the default average number of regions on each RegionServer, that is, 2,000), the adjustment solution is (actual specifications/default specifications) x Default time. + +Set the **splitlog** parameter in the **hbase-site.xml** file on the server. :ref:`Table 1 ` describes the parameter. + +.. _mrs_01_1655__td061a2527dd94860b0b6d9989d7fd9ee: + +.. table:: **Table 1** Description of the **splitlog** parameter + + +--------------------------------+----------------------------------------------------------------------------------------------+---------------+ + | Parameter | Description | Default Value | + +================================+==============================================================================================+===============+ + | hbase.splitlog.manager.timeout | Timeout period for receiving worker response by the distributed SplitLog management program. | 600000 | + +--------------------------------+----------------------------------------------------------------------------------------------+---------------+ diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_modified_and_deleted_data_can_still_be_queried_by_using_the_scan_command.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_modified_and_deleted_data_can_still_be_queried_by_using_the_scan_command.rst new file mode 100644 index 0000000..8d62aef --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_modified_and_deleted_data_can_still_be_queried_by_using_the_scan_command.rst @@ -0,0 +1,43 @@ +:original_name: mrs_01_1647.html + +.. _mrs_01_1647: + +Why Modified and Deleted Data Can Still Be Queried by Using the Scan Command? +============================================================================= + +Question +-------- + +Why modified and deleted data can still be queried by using the **scan** command? + +.. code-block:: + + scan '',{FILTER=>"SingleColumnValueFilter('','column',=,'binary:')"} + +Answer +------ + +Because of the scalability of HBase, all values specific to the versions in the queried column are all matched by default, even if the values have been modified or deleted. For a row where column matching has failed (that is, the column does not exist in the row), the HBase also queries the row. + +If you want to query only the new values and rows where column matching is successful, you can use the following statement: + +.. code-block:: + + scan '',{FILTER=>"SingleColumnValueFilter('','column',=,'binary:',true,true)"} + +This command can filter all rows where column query has failed. It queries only the latest values of the current data in the table; that is, it does not query the values before modification or the deleted values. + +.. note:: + + The related parameters of **SingleColumnValueFilter** are described as follows: + + SingleColumnValueFilter(final byte[] family, final byte[] qualifier, final CompareOp compareOp, ByteArrayComparable comparator, final boolean filterIfMissing, final boolean latestVersionOnly) + + Parameter description: + + - family: family of the column to be queried. + - qualifier: column to be queried. + - compareOp: comparison operation, such as = and >. + - comparator: target value to be queried. + - filterIfMissing: whether a row is filtered out if the queried column does not exist. The default value is false. + - latestVersionOnly: whether values of the latest version are queried. The default value is false. diff --git a/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_other_services_become_unstable_if_hbase_sets_up_a_large_number_of_connections_over_the_network_port.rst b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_other_services_become_unstable_if_hbase_sets_up_a_large_number_of_connections_over_the_network_port.rst new file mode 100644 index 0000000..3c6347e --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/common_issues_about_hbase/why_other_services_become_unstable_if_hbase_sets_up_a_large_number_of_connections_over_the_network_port.rst @@ -0,0 +1,37 @@ +:original_name: mrs_01_1642.html + +.. _mrs_01_1642: + +Why Other Services Become Unstable If HBase Sets up A Large Number of Connections over the Network Port? +======================================================================================================== + +Question +-------- + +Why other services become unstable if HBase sets up a large number of connections over the network port? + +Answer +------ + +When the OS command **lsof** or **netstat** is run, it is found that many TCP connections are in the CLOSE_WAIT state and the owner of the connections is HBase RegionServer. This can cause exhaustion of network ports or limit exceeding of HDFS connections, resulting in instability of other services. The HBase CLOSE_WAIT phenomenon is the HBase mechanism. + +The reason why HBase CLOSE_WAIT occurs is as follows: HBase data is stored in the HDFS as HFile, which can be called StoreFiles. HBase functions as the client of the HDFS. When HBase creates a StoreFile or starts loading a StoreFile, it creates an HDFS connection. When the StoreFile is created or loaded successfully, the HDFS considers that the task is completed and transfers the connection close permission to HBase. However, HBase may choose not to close the connection to ensure real-time response; that is, HBase may maintain the connection so that it can quickly access the corresponding data file upon request. In this case, the connection is in the CLOSE_WAIT, which indicates that the connection needs to be closed by the client. + +When a StoreFile will be created: HBase executes the Flush operation. + +When Flush is executed: The data written by HBase is first stored in memstore. The Flush operation is performed only when the usage of memstore reaches the threshold or the **flush** command is run to write data into the HDFS. + +To resolve the issue, use either of the following methods: + +Because of the HBase connection mechanism, the number of StoreFiles must be restricted to reduce the occupation of HBase ports. This can be achieved by triggering HBase's the compaction action, that is, HBase file merging. + +Method 1: On HBase shell client, run **major_compact**. + +Method 2: Compile HBase client code to invoke the compact method of the HBaseAdmin class to trigger HBase's compaction action. + +If the HBase port occupation issue cannot be resolved through compact, it indicates that the HBase usage has reached the bottleneck. In such a case, you are advised to perform the following: + +- Check whether the initial number of Regions configured in the table is appropriate. +- Check whether useless data exists. + +If useless data exists, delete the data to reduce the number of storage files for the HBase. If the preceding conditions are not met, then you need to consider a capacity expansion. diff --git a/doc/component-operation-guide/source/using_hbase/community_bulkload_tool.rst b/doc/component-operation-guide/source/using_hbase/community_bulkload_tool.rst new file mode 100644 index 0000000..7fd7a3c --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/community_bulkload_tool.rst @@ -0,0 +1,8 @@ +:original_name: mrs_01_1612.html + +.. _mrs_01_1612: + +Community BulkLoad Tool +======================= + +The Apache HBase official website provides the function of importing data in batches. For details, see the description of the **Import** and **ImportTsv** tools at http://hbase.apache.org/2.2/book.html#tools. diff --git a/doc/component-operation-guide/source/using_hbase/configuring_hbase_data_compression_and_encoding.rst b/doc/component-operation-guide/source/using_hbase/configuring_hbase_data_compression_and_encoding.rst new file mode 100644 index 0000000..b4d902f --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/configuring_hbase_data_compression_and_encoding.rst @@ -0,0 +1,113 @@ +:original_name: mrs_01_24112.html + +.. _mrs_01_24112: + +Configuring HBase Data Compression and Encoding +=============================================== + +Scenario +-------- + +HBase encodes data blocks in HFiles to reduce duplicate keys in KeyValues, reducing used space. Currently, the following data block encoding modes are supported: NONE, PREFIX, DIFF, FAST_DIFF, and ROW_INDEX_V1. NONE indicates that data blocks are not encoded. HBase also supports compression algorithms for HFile compression. The following algorithms are supported by default: NONE, GZ, SNAPPY, and ZSTD. NONE indicates that HFiles are not compressed. + +The two methods are used on the HBase column family. They can be used together or separately. + +Prerequisites +------------- + +- You have installed an HBase client. For example, the client is installed in **opt/client**. +- If authentication has been enabled for HBase, you must have the corresponding operation permissions. For example, you must have the creation (C) or administration (A) permission on the corresponding namespace or higher-level items to create a table, and the creation (C) or administration (A) permission on the created table or higher-level items to modify a table. For details about how to grant permissions, see :ref:`Creating HBase Roles `. + +Procedure +--------- + +**Setting data block encoding and compression algorithms during creation** + +- **Method 1: Using hbase shell** + + #. Log in to the node where the client is installed as the client installation user. + + #. Run the following command to go to the client directory: + + **cd /opt/client** + + #. Run the following command to configure environment variables: + + **source bigdata_env** + + #. If the Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step: + + **kinit** *Component service user* + + For example, **kinit hbaseuser**. + + #. Run the following HBase client command: + + **hbase shell** + + #. Create a table. + + **create '**\ *t1*\ **', {NAME => '**\ *f1*\ **', COMPRESSION => '**\ *SNAPPY*\ **', DATA_BLOCK_ENCODING => '**\ *FAST_DIFF*\ **'}** + + .. note:: + + - *t1*: indicates the table name. + - *f1*: indicates the column family name. + - *SNAPPY*: indicates the column family uses the SNAPPY compression algorithm. + - *FAST_DIFF*: indicates FAST_DIFF is used for encoding. + - The parameter in the braces specifies the column family. You can specify multiple column families using multiple braces and separate them by commas (,). For details about table creation statements, run the **help 'create'** statement in the HBase shell. + +- **Method 2: Using Java APIs** + + The following code snippet shows only how to set the encoding and compression modes of a column family when creating a table. For complete code for creating a table and how to use the code to create a table, see "HBase Development Guide" > "Modifying a Table" in . + + .. code-block:: + + TableDescriptorBuilder htd = TableDescriptorBuilder.newBuilder(TableName.valueOf("t1"));// Create a descriptor for table t1. + ColumnFamilyDescriptorBuilder hcd = ColumnFamilyDescriptorBuilder.newBuilder(Bytes.toBytes("f1"));// Create a builder for column family f1. + hcd.setDataBlockEncoding(DataBlockEncoding.FAST_DIFF);// Set the encoding mode of column family f1 to FAST_DIFF. + hcd.setCompressionType(Compression.Algorithm.SNAPPY);// Set the compression algorithm of column family f1 to SNAPPY. + htd.setColumnFamily(hcd.build())// Add the column family f1 to the descriptor of table t1. + +**Setting or modifying the data block encoding mode and compression algorithm for an existing table** + +- **Method 1: Using hbase shell** + + #. Log in to the node where the client is installed as the client installation user. + + #. Run the following command to go to the client directory: + + **cd /opt/client** + + #. Run the following command to configure environment variables: + + **source bigdata_env** + + #. If the Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step: + + **kinit** *Component service user* + + For example, **kinit hbaseuser**. + + #. Run the following HBase client command: + + **hbase shell** + + #. Run the following command to modify the table: + + **alter '**\ *t1*\ **', {NAME => '**\ *f1*\ **', COMPRESSION => '**\ *SNAPPY*\ **', DATA_BLOCK_ENCODING => '**\ *FAST_DIFF*\ **'}** + +- **Method 2: Using Java APIs** + + The following code snippet shows only how to modify the encoding and compression modes of a column family in an existing table. For complete code for modifying a table and how to use the code to modify a table, see "HBase Development Guide". + + .. code-block:: + + TableDescriptor htd = admin.getDescriptor(TableName.valueOf("t1"));// Obtain the descriptor of table t1. + ColumnFamilyDescriptor originCF = htd.getColumnFamily(Bytes.toBytes("f1"));// Obtain the descriptor of column family f1. + builder.ColumnFamilyDescriptorBuilder hcd = ColumnFamilyDescriptorBuilder.newBuilder(originCF);// Create a builder based on the existing column family attributes. + hcd.setDataBlockEncoding(DataBlockEncoding.FAST_DIFF);// Change the encoding mode of the column family to FAST_DIFF. + hcd.setCompressionType(Compression.Algorithm.SNAPPY);// Change the compression algorithm of the column family to SNAPPY. + admin.modifyColumnFamily(TableName.valueOf("t1"), hcd.build());// Submit to the server to modify the attributes of column family f1. + + After the modification, the encoding and compression modes of the existing HFile will take effect after the next compaction. diff --git a/doc/component-operation-guide/source/using_hbase/configuring_hbase_dr.rst b/doc/component-operation-guide/source/using_hbase/configuring_hbase_dr.rst new file mode 100644 index 0000000..ac738b3 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/configuring_hbase_dr.rst @@ -0,0 +1,306 @@ +:original_name: mrs_01_1609.html + +.. _mrs_01_1609: + +Configuring HBase DR +==================== + +Scenario +-------- + +HBase disaster recovery (DR), a key feature that is used to ensure high availability (HA) of the HBase cluster system, provides the real-time remote DR function for HBase. HBase DR provides basic O&M tools, including tools for maintaining and re-establishing DR relationships, verifying data, and querying data synchronization progress. To implement real-time DR, back up data of an HBase cluster to another HBase cluster. DR in the HBase table common data writing and BulkLoad batch data writing scenarios is supported. + +.. note:: + + This section applies to MRS 3.\ *x* or later clusters. + +Prerequisites +------------- + +- The active and standby clusters are successfully installed and started, and you have the administrator permissions on the clusters. + +- Ensure that the network connection between the active and standby clusters is normal and ports are available. +- If the active cluster is deployed in security mode and is not managed by one FusionInsight Manager, cross-cluster trust relationship has been configured for the active and standby clusters.. If the active cluster is deployed in normal mode, no cross-cluster mutual trust is required. +- Cross-cluster replication has been configured for the active and standby clusters. +- Time is consistent between the active and standby clusters and the NTP service on the active and standby clusters uses the same time source. +- Mapping relationships between the names of all hosts in the active and standby clusters and IP addresses have been configured in the hosts files of all the nodes in the active and standby clusters and of the node where the active cluster client resides. +- The network bandwidth between the active and standby clusters is determined based on service volume, which cannot be less than the possible maximum service volume. +- The MRS versions of the active and standby clusters must be the same. +- The scale of the standby cluster must be greater than or equal to that of the active cluster. + +Constraints +----------- + +- Although DR provides the real-time data replication function, the data synchronization progress is affected by many factors, such as the service volume in the active cluster and the health status of the standby cluster. In normal cases, the standby cluster should not take over services. In extreme cases, system maintenance personnel and other decision makers determine whether the standby cluster takes over services according to the current data synchronization indicators. + +- HBase clusters must be deployed in active/standby mode. +- Table-level operations on the DR table of the standby cluster are forbidden, such as modifying the table attributes and deleting the table. Misoperations on the standby cluster will cause data synchronization failure of the active cluster. As a result, table data in the standby cluster is lost. +- If the DR data synchronization function is enabled for HBase tables of the active cluster, the DR table structure of the standby cluster needs to be modified to ensure table structure consistency between the active and standby clusters during table structure modification. + +Procedure +--------- + +**Configuring the common data writing DR parameters for the active cluster** + +#. Log in to Manager of the active cluster. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Configurations** and click **All Configurations**. The HBase configuration page is displayed. + +#. (Optional) :ref:`Table 1 ` describes the optional configuration items during HBase DR. You can set the parameters based on the description or use the default values. + + .. _mrs_01_1609__tcc2ebdc7794f4718bf8175f779496069: + + .. table:: **Table 1** Optional configuration items + + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Navigation Path | Parameter | Default Value | Description | + +============================+==============================================+===============+=========================================================================================================================================================================================================================================================================================================================================================+ + | HMaster > Performance | hbase.master.logcleaner.ttl | 600000 | Specifies the retention period of HLog. If the value is set to **604800000** (unit: millisecond), the retention period of HLog is 7 days. | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hbase.master.cleaner.interval | 60000 | Interval for the HMaster to delete historical HLog files. The HLog that exceeds the configured period will be automatically deleted. You are advised to set it to the maximum value to save more HLogs. | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | RegionServer > Replication | replication.source.size.capacity | 16777216 | Maximum size of edits, in bytes. If the edit size exceeds the value, HLog edits will be sent to the standby cluster. | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | replication.source.nb.capacity | 25000 | Maximum number of edits, which is another condition for triggering HLog edits to be sent to the standby cluster. After data in the active cluster is synchronized to the standby cluster, the active cluster reads and sends data in HLog according to this parameter value. This parameter is used together with **replication.source.size.capacity**. | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | replication.source.maxretriesmultiplier | 10 | Maximum number of retries when an exception occurs during replication. | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | replication.source.sleepforretries | 1000 | Retry interval (Unit: ms) | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hbase.regionserver.replication.handler.count | 6 | Number of replication RPC server instances on RegionServer | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +**Configuring the BulkLoad batch data writing DR parameters for the active cluster** + +4. Determine whether to enable the BulkLoad batch data writing DR function. + + If yes, go to :ref:`5 `. + + If no, go to :ref:`8 `. + +5. .. _mrs_01_1609__l4716d1d3802e4b24ba3b3b49cf396866: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Configurations** and click **All Configurations**. The HBase configuration page is displayed. + +6. Search for **hbase.replication.bulkload.enabled** and change its value to **true** to enable the BulkLoad batch data writing DR function. + +7. Search for **hbase.replication.cluster.id** and change the HBase ID of the active cluster. The ID is used by the standby cluster to connect to the active cluster. The value can contain uppercase letters, lowercase letters, digits, and underscores (_), and cannot exceed 30 characters. + +**Restarting the HBase service and install the client** + +8. .. _mrs_01_1609__l3a38ddf2af1b455995b7223d0fe94c23: + + Click **Save**. In the displayed dialog box, click **OK**. Restart the HBase service. + +9. In the active and standby clusters, choose **Cluster** > **Name of the desired cluster** > **Service** > **HBase** > **More** > **Download Client** to download the client and install it. + +**Adding the DR relationship between the active and standby clusters** + +10. Log in as user **hbase** to the HBase shell page of the active cluster. + +11. Run the following command on HBase Shell to create the DR synchronization relationship between the active cluster HBase and the standby cluster HBase. + + **add_peer '**\ *Standby cluster ID*\ **', CLUSTER_KEY => "**\ *ZooKeeper service IP address in the standby cluster* **", CONFIG => {"hbase.regionserver.kerberos.principal" => "**\ *Standby cluster RegionServer principal*\ **", "hbase.master.kerberos.principal" => "**\ *Standby cluster HMaster principal*\ **"}** + + - The standby cluster ID indicates the ID for the active cluster to recognize the standby cluster. Enter an ID. The value can be specified randomly. Digits are recommended. + - The ZooKeeper address of the standby cluster includes the service IP address of ZooKeeper, the port for listening to client connections, and the HBase root directory of the standby cluster on ZooKeeper. + - Search for **hbase.master.kerberos.principal** and **hbase.regionserver.kerberos.principal** in the HBase **hbase-site.xml** configuration file of the standby cluster. + + For example, to add the DR relationship between the active and standby clusters, run the **add_peer '**\ *Standby cluster ID*\ **', CLUSTER_KEY => "192.168.40.2,192.168.40.3,192.168.40.4:24002:/hbase", CONFIG => {"hbase.regionserver.kerberos.principal" => "hbase/hadoop.hadoop.com@HADOOP.COM", "hbase.master.kerberos.principal" => "hbase/hadoop.hadoop.com@HADOOP.COM"}** + +12. (Optional) If the BulkLoad batch data write DR function is enabled, the HBase client configuration of the active cluster must be copied to the standby cluster. + + - Create the **/hbase/replicationConf/**\ **hbase.replication.cluster.id of the active cluster** directory in the HDFS of the standby cluster. + + - HBase client configuration file, which is copied to the **/hbase/replicationConf/hbase.replication.cluster.id of the active cluster** directory of the HDFS of the standby cluster. + + Example: **hdfs dfs -put HBase/hbase/conf/core-site.xml HBase/hbase/conf/hdfs-site.xml HBase/hbase/conf/yarn-site.xml hdfs://NameNode IP:25000/hbase/replicationConf/source_cluster** + +**Enabling HBase DR to synchronize data** + +13. Check whether a naming space exists in the HBase service instance of the standby cluster and the naming space has the same name as the naming space of the HBase table for which the DR function is to be enabled. + + - If the same namespace exists, go to :ref:`14 `. + - If no, create a naming space with the same name in the HBase shell of the standby cluster and go to :ref:`14 `. + +14. .. _mrs_01_1609__li254519151517: + + In the HBase shell of the active cluster, run the following command as user **hbase** to enable the real-time DR function for the table data of the active cluster to ensure that the data modified in the active cluster can be synchronized to the standby cluster in real time. + + You can only synchronize the data of one HTable at a time. + + **enable_table_replication '**\ *table name*\ **'** + + .. note:: + + - If the standby cluster does not contain a table with the same name as the table for which real-time synchronization is to be enabled, the table is automatically created. + - If a table with the same name as the table for which real-time synchronization is to be enabled exists in the standby cluster, the structures of the two tables must be the same. + - If the encryption algorithm SMS4 or AES is configured for '*Table name*', the function for synchronizing data from the active cluster to the standby cluster cannot be enabled for the HBase table. + - If the standby cluster is offline or has tables with the same name but different structures, the DR function cannot be enabled. + - If the DR data synchronization function is enabled for some Phoenix tables in the active cluster, the standby cluster cannot have common HBase tables with the same names as the Phoenix tables in the active cluster. Otherwise, the DR function fails to be enabled or the tables with the names in the standby cluster cannot be used properly. + - If the DR data synchronization function is enabled for Phoenix tables in the active cluster, you need to enable the DR data synchronization function for the metadata tables of the Phoenix tables. The metadata tables include SYSTEM.CATALOG, SYSTEM.FUNCTION, SYSTEM.SEQUENCE, and SYSTEM.STATS. + - If the DR data synchronization function is enabled for HBase tables of the active cluster, after adding new indexes to HBase tables, you need to manually add secondary indexes to DR tables in the standby cluster to ensure secondary index consistency between the active and standby clusters. + - The HBase multi-instance function also supports DR. You need to modify the parameters on the HBase service instance that corresponds to the standby cluster and run the commands on the clients of multiple instances. When adding the DR relationship, you need to select the directory, such as **hbase1**, for ZooKeeper of the standby cluster to store HBase multi-instance data. + +15. (Optional) If HBase does not use Ranger, run the following command as user **hbase** in the HBase shell of the active cluster to enable the real-time permission to control data DR function for the HBase tables in the active cluster. + + **enable_table_replication 'hbase:acl'** + +**Creating Users** + +16. Log in to FusionInsight Manager of the standby cluster, choose **System** > **Permission** > **Role** > **Create Role** to create a role, and add the same permission for the standby data table to the role based on the permission of the HBase source data table of the active cluster. +17. Choose **System** > **Permission** > **User** > **Create** to create a user. Set the **User Type** to **Human-Machine** or **Machine-Machine** based on service requirements and add the user to the created role. Access the HBase DR data of the standby cluster as the newly created user. + + .. note:: + + - After the permission of the active HBase source data table is modified, to ensure that the standby cluster can properly read data, modify the role permission for the standby cluster. + - If the current component uses Ranger for permission control, you need to configure permission management policies based on Ranger. For details, see :ref:`Adding a Ranger Access Permission Policy for HBase `. + +**Synchronizing the table data of the active cluster** + +18. After HBase DR is configured and data synchronization is enabled, check whether tables and data exist in the active cluster and whether the historical data needs to be synchronized to the standby cluster. + + - If yes, a table exists and data needs to be synchronized. Log in as the HBase table user to the node where the HBase client of the active cluster is installed and run the kinit username to authenticate the identity. The user must have the read and write permissions on tables and the execute permission on the **hbase:meta** table. Then go to :ref:`19 `. + - If no, no further action is required. + +19. .. _mrs_01_1609__li2511113725912: + + The HBase DR configuration does not support automatic synchronization of historical data in tables. You need to back up the historical data of the active cluster and then manually restore the historical data in the standby cluster. + + Manual recovery refers to the recovery of a single table, which can be performed through Export, DistCp, or Import. + + To manually recover a single table, perform the following steps: + + a. Export table data from the active cluster. + + **hbase org.apache.hadoop.hbase.mapreduce.Export -Dhbase.mapreduce.include.deleted.rows=true** *Table name* *Directory where the source data is stored* + + Example: **hbase org.apache.hadoop.hbase.mapreduce.Export -Dhbase.mapreduce.include.deleted.rows=true t1 /user/hbase/t1** + + b. Copy the data that has been exported to the standby cluster. + + **hadoop distcp** *directory where the source data is stored on the active cluster* **hdfs://**\ *ActiveNameNodeIP:8020/directory where the source data is stored on the standby cluster* + + **ActiveNameNodeIP** indicates the IP address of the active NameNode in the standby cluster. + + Example: **hadoop distcp /user/hbase/t1 hdfs://192.168.40.2:8020/user/hbase/t1** + + c. Import data to the standby cluster as the HBase table user of the standby cluster. + + On the HBase shell screen of the standby cluster, run the following command as user **hbase** to retain the data writing status: + + **set_clusterState_active** + + The command is run successfully if the following information is displayed: + + .. code-block:: + + hbase(main):001:0> set_clusterState_active + => true + + **hbase org.apache.hadoop.hbase.mapreduce.Import** *-Dimport.bulk.output=Directory where the output data is stored in the standby cluster Table name Directory where the source data is stored in the standby cluster* + + **hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles** *Directory where the output data is stored in the standby cluster Table name* + + Example: + + .. code-block:: + + hbase(main):001:0> set_clusterState_active + => true + + **hbase org.apache.hadoop.hbase.mapreduce.Import -Dimport.bulk.output=/user/hbase/output_t1 t1 /user/hbase/t1** + + **hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hbase/output_t1 t1** + +20. Run the following command on the HBase client to check the synchronized data of the active and standby clusters. After the DR data synchronization function is enabled, you can run this command to check whether the newly synchronized data is consistent. + + **hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication --starttime**\ *=Start time* **--endtime**\ *=End time* *Column family name ID of the standby cluster Table name* + + .. note:: + + - The start time must be earlier than the end time. + - The values of **starttime** and **endtime** must be in the timestamp format. You need to run **date -d "2015-09-30 00:00:00" +%s to** change a common time format to a timestamp format. + +**Specify the data writing status for the active and standby clusters.** + +21. On the HBase shell screen of the active cluster, run the following command as user **hbase** to retain the data writing status: + + **set_clusterState_active** + + The command is run successfully if the following information is displayed: + + .. code-block:: + + hbase(main):001:0> set_clusterState_active + => true + +22. On the HBase shell screen of the standby cluster, run the following command as user **hbase** to retain the data read-only status: + + **set_clusterState_standby** + + The command is run successfully if the following information is displayed: + + .. code-block:: + + hbase(main):001:0> set_clusterState_standby + => true + +Related Commands +---------------- + +.. table:: **Table 2** HBase DR + + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation | Command | Description | + +=================================================================================+================================================================================================================================================================================================================================================================================+=======================================================================================================================================================================================================================================================================================================================+ + | Set up a DR relationship. | add_peer'*Standby cluster ID*', CLUSTER_KEY => "*Standby cluster ZooKeeper service IP address*", CONFIG => {"hbase.regionserver.kerberos.principal" => "*Standby cluster RegionServer principal*", "hbase.master.kerberos.principal" => "*Standby cluster HMaster principal*"} | Set up the relationship between the active cluster and the standby cluster. | + | | | | + | | **add_peer '1','zk1,zk2,zk3:2181:/hbase1'** | If BulkLoad batch data write DR is enabled: | + | | | | + | | **2181**: port number of ZooKeeper in the cluster | - Create the **/hbase/replicationConf/hbase.replication.cluster.id of the active cluster** directory in the HDFS of the standby cluster. | + | | | - HBase client configuration file, which is copied to the **/hbase/replicationConf/hbase.replication.cluster.id of the active cluster** directory of the HDFS of the standby cluster. | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Remove the DR relationship. | **remove_peer** *'Standby cluster ID'* | Remove standby cluster information from the active cluster. | + | | | | + | | Example: | | + | | | | + | | **remove_peer '1'** | | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Querying the DR Relationship | **list_peers** | Query standby cluster information (mainly Zookeeper information) in the active cluster. | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Enable the real-time user table synchronization function. | **enable_table_replication** *'Table name'* | Synchronize user tables from the active cluster to the standby cluster. | + | | | | + | | Example: | | + | | | | + | | **enable_table_replication 't1'** | | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Disable the real-time user table synchronization function. | **disable_table_replication** *'Table name'* | Do not synchronize user tables from the active cluster to the standby cluster. | + | | | | + | | Example: | | + | | | | + | | **disable_table_replication 't1'** | | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Verify data of the active and standby clusters. | **bin/hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication --starttime=**\ *Start time* **--endtime=**\ *End time* *Column family name Standby cluster ID Table name* | Verify whether data of the specified table is the same between the active cluster and the standby cluster. | + | | | | + | | | The description of the parameters in this command is as follows: | + | | | | + | | | - Start time: If start time is not specified, the default value **0** will be used. | + | | | - End time: If end time is not specified, the time when the current operation is submitted will be used by default. | + | | | - Table name: If a table name is not entered, all user tables for which the real-time synchronization function is enabled will be verified by default. | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Switch the data writing status. | **set_clusterState_active** | Specifies whether data can be written to the cluster HBase tables. | + | | | | + | | **set_clusterState_standby** | | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Add or update the active cluster HDFS configurations saved in the peer cluster. | **hdfs dfs -put -f HBase/hbase/conf/core-site.xml HBase/hbase/conf/hdfs-site.xml HBase/hbase/conf/yarn-site.xml hdfs://**\ *Standby cluster* **NameNode** **IP:PORT/hbase/replicationConf/**\ *Active cluster*\ **hbase.replication.cluster.id** | Enable DR for data including bulkload data. When HDFS parameters are modified in the active cluster, the modification cannot be automatically synchronized from the active cluster to the standby cluster. You need to manually run the command to synchronize configuration. The affected parameters are as follows: | + | | | | + | | | - fs.defaultFS | + | | | - dfs.client.failover.proxy.provider.hacluster | + | | | - dfs.client.failover.connection.retries.on.timeouts | + | | | - dfs.client.failover.connection.retries | + | | | | + | | | For example, change **fs.defaultFS** to **hdfs://hacluster_sale**, | + | | | | + | | | HBase client configuration file, which is copied to the **/hbase/replicationConf/hbase.replication.cluster.id of the active cluster** directory of the HDFS of the standby cluster. | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_hbase/configuring_hbase_parameters.rst b/doc/component-operation-guide/source/using_hbase/configuring_hbase_parameters.rst new file mode 100644 index 0000000..d37aca1 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/configuring_hbase_parameters.rst @@ -0,0 +1,40 @@ +:original_name: mrs_01_0443.html + +.. _mrs_01_0443: + +Configuring HBase Parameters +============================ + +.. note:: + + The operations described in this section apply only to clusters of versions earlier than MRS 3.x. + +If the default parameter settings of the MRS service cannot meet your requirements, you can modify the parameter settings as required. + +#. Log in to the service page. + + For versions earlier than MRS 1.9.2: Log in to `MRS Manager `__, and choose **Services**. + + For MRS 1.9.2 or later: Click the cluster name on the MRS console and choose **Components**. + +#. Choose **HBase** > **Service Configuration** and switch **Basic** to **All**. On the displayed HBase configuration page, modify parameter settings. + + .. table:: **Table 1** HBase parameters + + +---------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ + | Parameter | Description | Value | + +=======================================+=======================================================================================================================================================================================================================================================================+=================================+ + | hbase.regionserver.hfile.durable.sync | Whether to enable the HFile durability to make data persistence on disks. If this parameter is set to **true**, HBase performance is affected because each HFile is synchronized to disks by **hadoop fsync** when being written to HBase. | Possible values are as follows: | + | | | | + | | This parameter exists only in MRS 1.9.2 or earlier. | - **true** | + | | | - **false** | + | | | | + | | | The default value is **true**. | + +---------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ + | hbase.regionserver.wal.durable.sync | Specifies whether to enable WAL file durability to make the WAL data persistence on disks. If this parameter is set to **true**, HBase performance is affected because each edited WAL file is synchronized to disks by **hadoop fsync** when being written to HBase. | Possible values are as follows: | + | | | | + | | This parameter exists only in MRS 1.9.2 or earlier. | - **true** | + | | | - **false** | + | | | | + | | | The default value is **true**. | + +---------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ diff --git a/doc/component-operation-guide/source/using_hbase/configuring_hbase_replication.rst b/doc/component-operation-guide/source/using_hbase/configuring_hbase_replication.rst new file mode 100644 index 0000000..4700de3 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/configuring_hbase_replication.rst @@ -0,0 +1,404 @@ +:original_name: mrs_01_0501.html + +.. _mrs_01_0501: + +Configuring HBase Replication +============================= + +Scenario +-------- + +As a key feature to ensure high availability of the HBase cluster system, HBase cluster replication provides HBase with remote data replication in real time. It provides basic O&M tools, including tools for maintaining and re-establishing active/standby relationships, verifying data, and querying data synchronization progress. To achieve real-time data replication, you can replicate data from the HBase cluster to another one. + +Prerequisites +------------- + +- The active and standby clusters have been successfully installed and started (the cluster status is **Running** on the **Active Clusters** page), and you have the administrator rights of the clusters. + +- The network between the active and standby clusters is normal and ports can be used properly. +- Cross-cluster mutual trust has been configured. For details, see `Configuring Cross-Cluster Mutual Trust Relationships `__. +- If historical data exists in the active cluster and needs to be synchronized to the standby cluster, cross-cluster replication must be configured for the active and standby clusters. For details, see :ref:`Enabling Cross-Cluster Copy `. +- Time is consistent between the active and standby clusters and the Network Time Protocol (NTP) service on the active and standby clusters uses the same time source. +- Mapping relationships between the names of all hosts in the active and standby clusters and service IP addresses have been configured in the **/etc/hosts** file by appending **192.***.***.**\* host1** to the **hosts** file. +- The network bandwidth between the active and standby clusters is determined based on service volume, which cannot be less than the possible maximum service volume. + +Constraints +----------- + +- Despite that HBase cluster replication provides the real-time data replication function, the data synchronization progress is determined by several factors, such as the service loads in the active cluster and the health status of processes in the standby cluster. In normal cases, the standby cluster should not take over services. In extreme cases, system maintenance personnel and other decision makers determine whether the standby cluster takes over services according to the current data synchronization indicators. + +- Currently, the replication function supports only one active cluster and one standby cluster in HBase. +- Typically, do not perform operations on data synchronization tables in the standby cluster, such as modifying table properties or deleting tables. If any misoperation on the standby cluster occurs, data synchronization between the active and standby clusters will fail and data of the corresponding table in the standby cluster will be lost. +- If the replication function of HBase tables in the active cluster is enabled for data synchronization, after modifying the structure of a table in the active cluster, you need to manually modify the structure of the corresponding table in the standby cluster to ensure table structure consistency. + +Procedure +--------- + +**Enable the replication function for the active cluster to synchronize data written by Put.** + +#. .. _mrs_01_0501__li155891430132615: + + Log in to the service page. + + For versions earlier than MRS 1.9.2: Log in to `MRS Manager `__, and choose **Services**. + + For MRS 1.9.2 or later: Click the cluster name on the MRS console and choose **Components**. + +#. Go to the **All Configurations** page of the HBase service. For details, see :ref:`Modifying Cluster Service Configuration Parameters `. + + .. note:: + + For clusters of MRS 1.9.2 or later: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +#. Choose **RegionServer** > **Replication** and check whether the value of **hbase.replication** is **true**. If the value is **false**, set **hbase.replication** to **true**. + + .. note:: + + In MRS 2.\ *x*, this configuration has been removed. Skip this step. + +#. (Optional) Set configuration items listed in :ref:`Table 1 `. You can set the parameters based on the description or use the default values. + + .. _mrs_01_0501__table6909942154955: + + .. table:: **Table 1** Optional configuration items + + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Navigation Path | Parameter | Default Value | Description | + +============================+==============================================+===============+=========================================================================================================================================================================================================================================================================================================================================================+ + | HMaster > Performance | hbase.master.logcleaner.ttl | 600000 | Time to live (TTL) of HLog files. If the value is set to **604800000** (unit: millisecond), the retention period of HLog is 7 days. | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hbase.master.cleaner.interval | 60000 | Interval for the HMaster to delete historical HLog files. The HLog that exceeds the configured period will be automatically deleted. You are advised to set it to the maximum value to save more HLogs. | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | RegionServer > Replication | replication.source.size.capacity | 16777216 | Maximum size of edits, in bytes. If the edit size exceeds the value, HLog edits will be sent to the standby cluster. | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | replication.source.nb.capacity | 25000 | Maximum number of edits, which is another condition for triggering HLog edits to be sent to the standby cluster. After data in the active cluster is synchronized to the standby cluster, the active cluster reads and sends data in HLog according to this parameter value. This parameter is used together with **replication.source.size.capacity**. | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | replication.source.maxretriesmultiplier | 10 | Maximum number of retries when an exception occurs during replication. | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | replication.source.sleepforretries | 1000 | Retry interval (unit: ms) | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hbase.regionserver.replication.handler.count | 6 | Number of replication RPC server instances on RegionServer | + +----------------------------+----------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +**Enable the replication function for the active cluster to synchronize data written by bulkload.** + +5. .. _mrs_01_0501__li65160752154955: + + Determine whether to enable bulkload replication. + + .. note:: + + If bulkload import is used and data needs to be synchronized, you need to enable Bulkload replication. + + If yes, go to :ref:`6 `. + + If no, go to :ref:`10 `. + +6. .. _mrs_01_0501__li57688977154955: + + Go to the **All Configurations** page of the HBase service parameters by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + +7. On the HBase configuration interface of the active and standby clusters, search for **hbase.replication.cluster.id** and modify it. It specifies the HBase ID of the active and standby clusters. For example, the HBase ID of the active cluster is set to **replication1** and the HBase ID of the standby cluster is set to **replication2** for connecting the active cluster to the standby cluster. To save data overhead, the parameter value length is not recommended to exceed 30. + +8. .. _mrs_01_0501__li3244131341713: + + On the HBase configuration interface of the standby cluster, search for **hbase.replication.conf.dir** and modify it. It specifies the HBase configurations of the active cluster client used by the standby cluster and is used for data replication when the bulkload data replication function is enabled. The parameter value is a path name, for example, **/home**. + + .. note:: + + - In versions earlier than MRS 3.x, you do not need to set this parameter. Skip :ref:`8 `. + - When bulkload replication is enabled, you need to manually place the HBase client configuration files (**core-site.xml**, **hdfs-site.xml**, and **hbase-site.xml**) in the active cluster on all RegionServer nodes in the standby cluster. The actual path for placing the configuration file is **${hbase.replication.conf.dir}/${hbase.replication.cluster.id}**. For example, if **hbase.replication.conf.dir** of the standby cluster is set to **/home** and **hbase.replication.cluster.id** of the active cluster is set to **replication1**, the actual path for placing the configuration files in the standby cluster is **/home/replication1**. You also need to change the corresponding directory and file permissions by running the **chown -R omm:wheel /home/replication1** command. + - You can obtain the client configuration files from the client in the active cluster, for example, the **/opt/client/HBase/hbase/conf** path. For details about how to update the configuration file, see `Updating a Client `__. + +9. On the HBase configuration page of the active cluster, search for and change the value of **hbase.replication.bulkload.enabled** to **true** to enable bulkload replication. + +**Restarting the HBase service and install the client** + +10. .. _mrs_01_0501__li6210082154955: + + Save the configurations and restart HBase. + +11. .. _mrs_01_0501__li11385192216347: + + In the active and standby clusters of MRS 1.9.2 or earlier, choose **Cluster** > **Dashboard** > **More** > **Download Client** of MRS 1.9.2 or later, choose **Cluster** > **Dashboard** > **More** > **Download Client**. For details about how to update the client configuration file, see `Updating a Client `__. + +**Synchronize table data of the active cluster. (Skip this step if the active cluster has no data.)** + +12. .. _mrs_01_0501__li12641483154955: + + Access the HBase shell of the active cluster as user **hbase**. + + a. On the active management node where the client has been updated, run the following command to go to the client directory: + + **cd /opt/client** + + b. Run the following command to configure environment variables: + + **source bigdata_env** + + c. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit hbase** + + .. note:: + + The system prompts you to enter the password after you run **kinit** **hbase**. The default password of user **hbase** is **Hbase@123**. + + d. Run the following HBase client command: + + **hbase shell** + +13. Check whether historical data exists in the standby cluster. If historical data exists and data in the active and standby clusters must be consistent, delete data from the standby cluster first. + + a. On the HBase shell of the standby cluster, run the **list** command to view the existing tables in the standby cluster. + + b. Delete data tables from the standby cluster based on the output list. + + **disable** '*tableName*' + + **drop** '*tableName*' + +14. After HBase replication is configured and data synchronization is enabled, check whether tables and data exist in the active cluster and whether the historical data needs to be synchronized to the standby cluster. + + Run the **list** command to check the existing tables in the active cluster and run the **scan** '*tableName*\ **'** command to check whether the tables contain historical data. + + - If tables exist and data needs to be synchronized, go to :ref:`15 `. + - If no, no further action is required. + +15. .. _mrs_01_0501__li4226821210491: + + The HBase replication configuration does not support automatic synchronization of historical data in tables. You need to back up the historical data of the active cluster and then manually synchronize the historical data to the standby cluster. + + Manual synchronization refers to the synchronization of a single table that is implemented by Export, distcp, and Import. + + The process for manually synchronizing data of a single table is as follows: + + a. Export table data from the active cluster. + + **hbase org.apache.hadoop.hbase.mapreduce.Export -Dhbase.mapreduce.include.deleted.rows=true** *Table name* *Directory where the source data is stored* + + Example: **hbase org.apache.hadoop.hbase.mapreduce.Export -Dhbase.mapreduce.include.deleted.rows=true t1 /user/hbase/t1** + + b. Copy the data that has been exported to the standby cluster. + + **hadoop distcp** *Directory for storing source data in the active cluster* **hdfs://**\ *ActiveNameNodeIP*:**9820/** *Directory for storing source data in the standby cluster* + + **ActiveNameNodeIP** indicates the IP address of the active NameNode in the standby cluster. + + Example: **hadoop distcp /user/hbase/t1 hdfs://192.168.40.2:9820/user/hbase/t1** + + .. note:: + + In MRS 1.6.2 and earlier versions, the default port number is 25000. For details, see `List of Open Source Component Ports `__. + + c. Import data to the standby cluster as the HBase table user of the standby cluster. + + **hbase org.apache.hadoop.hbase.mapreduce.Import** *-Dimport.bulk.output=Directory where the output data is stored in the standby cluster Table name Directory where the source data is stored in the standby cluster* + + **hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles** *Directory where the output data is stored in the standby cluster Table name* + + For example, **hbase org.apache.hadoop.hbase.mapreduce.Import -Dimport.bulk.output=/user/hbase/output_t1 t1 /user/hbase/t1** and + + **hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hbase/output_t1 t1** + +**Add the replication relationship between the active and standby clusters.** + +16. .. _mrs_01_0501__li46664485154955: + + Run the following command on the HBase Shell to create the replication synchronization relationship between the active cluster and the standby cluster: + + **add_peer** '*Standby cluster ID',* *CLUSTER_KEY =>* '*ZooKeeper address of the standby cluster*',\ **{HDFS_CONFS => true}** + + - *Standby cluster ID* indicates an ID for the active cluster to recognize the standby cluster. It is recommended that the ID contain letters and digits. + + - The ZooKeeper address of the standby cluster includes the service IP address of ZooKeeper, the port for listening to client connections, and the HBase root directory of the standby cluster on ZooKeeper. + + - **{HDFS_CONFS => true}** indicates that the default HDFS configuration of the active cluster will be synchronized to the standby cluster. This parameter is used for HBase of the standby cluster to access HDFS of the active cluster. If bulkload replication is disabled, you do not need to use this parameter. + + Suppose the standby cluster ID is replication2 and the ZooKeeper address of the standby cluster is **192.168.40.2,192.168.40.3,192.168.40.4:2181:/hbase**. + + - For versions later than MRS 1.9.2: Run the **add_peer** **'replication2',\ CLUSTER_KEY =>** **'192.168.40.2,192.168.40.3,192.168.40.4:2181:/hbase',CONFIG => { "hbase.regionserver.kerberos.principal" => "", "hbase.master.kerberos.principal" => "" }** command for a security cluster and the **add_peer** **'replication2',\ CLUSTER_KEY =>** **'192.168.40.2,192.168.40.3,192.168.40.4:2181:/hbase'** command for a common cluster. + + The **hbase.master.kerberos.principal** and **hbase.regionserver.kerberos.principal** parameters are the Kerberos users of HBase in the security cluster. You can search the **hbase-site.xml** file on the client for the parameter values. For example, if the client is installed in the **/opt/client** directory of the Master node, you can run the **grep "kerberos.principal" /opt/client/HBase/hbase/conf/hbase-site.xml -A1** command to obtain the principal of HBase. See the following figure. + + + .. figure:: /_static/images/en-us_image_0000001295770664.png + :alt: **Figure 1** Obtaining the principal of HBase + + **Figure 1** Obtaining the principal of HBase + + - For MRS 1.9.2 or earlier: Run the **add_peer** **'replication2',\ CLUSTER_KEY =>** **'192.168.40.2,192.168.40.3,192.168.40.4:2181:/hbase'** command. + + .. note:: + + a. Obtain the ZooKeeper service IP address. + + For versions earlier than MRS 1.9.2: Choose **Services** > **ZooKeeper** > **Instance** to obtain the service IP address of ZooKeeper. + + For MRS 1.9.2 or later: Log in to the MRS console, click the cluster name, and choose **Components** > **ZooKeeper** > **Instances** to obtain the ZooKeeper service IP address. + + b. On the ZooKeeper service parameter configuration page, search for clientPort, which is the port for the client to connect to the server. + + c. Run the **list_peers** command to check whether the replication relationship between the active and standby clusters is added. If the following information is displayed, the relationship is successfully added. + + .. code-block:: + + hbase(main):003:0> list_peers + PEER_ID CLUSTER_KEY ENDPOINT_CLASSNAME STATE REPLICATE_ALL NAMESPACES TABLE_CFS BANDWIDTH SERIAL + replication2 192.168.0.13,192.168.0.177,192.168.0.25:2181:/hbase ENABLED true 0 false + + For versions earlier than MRS 1.9.2: If the following information is displayed after you run the **list_peers** command, the operation is successful. + + .. code-block:: + + hbase(main):003:0> list_peers + PEER_ID CLUSTER_KEY STATE TABLE_CFS + replication2 192.168.0.13,192.168.0.177,192.168.0.25:2181:/hbase ENABLED + +**Specify the data writing status for the active and standby clusters.** + +17. On the HBase shell of the active cluster, run the following command to retain the data writing status: + + **set_clusterState_active** + + The command is run successfully if the following information is displayed: + + .. code-block:: + + hbase(main):001:0> set_clusterState_active + => true + +18. On the HBase shell of the standby cluster, run the following command to retain the data read-only status: + + **set_clusterState_standby** + + The command is run successfully if the following information is displayed: + + .. code-block:: + + hbase(main):001:0> set_clusterState_standby + => true + +**Enable the HBase replication function to synchronize data.** + +19. Check whether a namespace exists in the HBase service instance of the standby cluster and the namespace has the same name as the namespace of the HBase table for which the replication function is to be enabled. + + On the HBase shell of the standby cluster, run the **list_namespace** command to query the namespace. + + - If the same namespace exists, go to :ref:`20 `. + + - If the same namespace does not exist, on the HBase shell of the standby cluster, run the following command to create a namespace with the same name and go to :ref:`20 `: + + **create_namespace'ns1** + +20. .. _mrs_01_0501__li15192291154955: + + On the HBase shell of the active cluster, run the following command to enable real-time replication for tables in the active cluster. This ensures that modified data in the active cluster can be synchronized to the standby cluster in real time. + + You can only synchronize data of one HTable at one time. + + **enable_table_replication '**\ *Table name*' + + .. note:: + + - If the standby cluster does not contain a table with the same name as the table for which real-time synchronization is to be enabled, the table is automatically created. + + - If a table with the same name as the table for which real-time synchronization is to be enabled exists in the standby cluster, the structures of the two tables must be the same. + + - If the encryption algorithm SMS4 or AES is configured for '*Table name*', the function for synchronizing data from the active cluster to the standby cluster cannot be enabled for the HBase table. + + - If the standby cluster is offline or has tables with the same name but different structures, the replication function cannot be enabled. + + If the standby cluster is offline, start it. + + If the standby cluster has a table with the same name but different structure, modify the table structure to make it as the same as the table structure of the active cluster. On the HBase shell of the standby cluster, run the **alter** command to change the password by referring to the example. + +21. .. _mrs_01_0501__li3638114154955: + + On the HBase shell of the active cluster, run the following command to enable the real-time replication function for the active cluster to synchronize the HBase permission table: + + **enable_table_replication 'hbase:acl'** + + .. note:: + + After the permission of the active HBase source data table is modified, to ensure that the standby cluster can properly read data, modify the role permission for the standby cluster. + +**Check the data synchronization status for the active and standby clusters.** + +22. Run the following command on the HBase client to check the synchronized data of the active and standby clusters. After the replication function is enabled, you can run this command to check whether the newly synchronized data is consistent. + + **hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication --starttime**\ *=Start time* **--endtime**\ *=End time* *Column family name ID of the standby cluster Table name* + + .. note:: + + - The start time must be earlier than the end time. + - The value of **starttime** and **endtime** must be in the timestamp format. You need to run **date -d "2015-09-30 00:00:00" +%s to** change a common time format to a timestamp format. The command output is a 10-digit number (accurate to second), but HBase identifies a 13-digit number (accurate to millisecond). Therefore, you need to add three zeros (000) to the end of the command output. + + **Switch over active and standby clusters.** + + .. note:: + + a. If the standby cluster needs to be switched over to the active cluster, reconfigure the active/standby relationship by referring to :ref:`1 ` to :ref:`11 ` and :ref:`16 ` to :ref:`21 `. + b. Do not perform :ref:`12 ` to :ref:`15 `. + +Related Commands +---------------- + +.. table:: **Table 2** HBase replication + + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Operation | Command | Description | + +=================================================================================+========================================================================================================================================================+=====================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Set up the active/standby relationship. | **add_peer** *'Standby cluster ID', 'Standby cluster address'* | Set up the relationship between the active cluster and the standby cluster. To enable bulkload replication, run the **add_peer** *'Standby cluster ID'*\ **,\ CLUSTER_KEY =>** *'Standby cluster address'* command, configure **hbase.replication.conf.dir**, and manually copy the HBase client configuration file in the active cluster to all RegionServer nodes in the standby cluster. For details, see :ref:`5 ` to :ref:`11 `. | + | | | | + | | Examples: | For MRS 1.9.2 or earlier, to enable bulkload replication, run the following command: **add_peer** *'Standby cluster ID',\ 'Standby cluster address'*,\ **{HDFS_CONF => true}**. | + | | | | + | | **add_peer '1', 'zk1,zk2,zk3:2181:/hbase'** | | + | | | | + | | **add_peer '1', 'zk1,zk2,zk3:2181:/hbase1'** | | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Remove the active/standby relationship. | **remove_peer** *'Standby cluster ID'* | Remove standby cluster information from the active cluster. | + | | | | + | | Example: | | + | | | | + | | **remove_peer '1'** | | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Query the active/standby relationship. | **list_peers** | Query standby cluster information (mainly Zookeeper information) in the active cluster. | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Enable the real-time user table synchronization function. | **enable_table_replication** *'Table name'* | Synchronize user tables from the active cluster to the standby cluster. | + | | | | + | | Example: | | + | | | | + | | **enable_table_replication 't1'** | | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Disable the real-time user table synchronization function. | **disable_table_replication** *'Table name'* | Do not synchronize user tables from the active cluster to the standby cluster. | + | | | | + | | Example: | | + | | | | + | | **disable_table_replication 't1'** | | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Verify data of the active and standby clusters. | **bin/hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication** *--starttime --endtime Column family name Standby cluster ID Table name* | Verify whether data of the specified table is the same between the active cluster and the standby cluster. | + | | | | + | | | The description of the parameters in this command is as follows: | + | | | | + | | | - Start time: If start time is not specified, the default value **0** will be used. | + | | | - End time: If end time is not specified, the time when the current operation is submitted will be used by default. | + | | | - Table name: If a table name is not entered, all user tables for which the real-time synchronization function is enabled will be verified by default. | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Switch the data writing status. | **set_clusterState_active** | Specifies whether data can be written to the cluster HBase tables. | + | | | | + | | **set_clusterState_standby** | | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Add or update the active cluster HDFS configurations saved in the peer cluster. | **set_replication_hdfs_confs 'PeerId', {'key1' => 'value1', 'key2' => 'value2'}** | Enable replication for data including bulkload data. When HDFS parameters are modified in the active cluster, the modification cannot be automatically synchronized to the standby cluster. You need to manually run the command to synchronize the changes. The affected parameters are as follows: | + | | | | + | | | - fs.defaultFS | + | | | - dfs.client.failover.proxy.provider.hacluster | + | | | - dfs.client.failover.connection.retries.on.timeouts | + | | | - dfs.client.failover.connection.retries | + | | | | + | | | For example, if the value of **fs.defaultFS** is changed to **hdfs://hacluster_sale**, run the **set_replication_hdfs_confs '1', {'fs.defaultFS' => 'hdfs://hacluster_sale'}** command to synchronization the HDFS configuration to the standby cluster whose ID is 1. | + | | | | + | | | In versions later than MRS 1.9.2, this command has been removed. If synchronization is required, manually copy the changed client configurations in the active cluster to the standby cluster. For details, see :ref:`8 `. | + +---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_hbase/configuring_region_in_transition_recovery_chore_service.rst b/doc/component-operation-guide/source/using_hbase/configuring_region_in_transition_recovery_chore_service.rst new file mode 100644 index 0000000..78ff4ae --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/configuring_region_in_transition_recovery_chore_service.rst @@ -0,0 +1,26 @@ +:original_name: mrs_01_1010.html + +.. _mrs_01_1010: + +Configuring Region In Transition Recovery Chore Service +======================================================= + +Scenario +-------- + +In a faulty environment, there are possibilities that a region may be stuck in transition for longer duration due to various reasons like slow region server response, unstable network, ZooKeeper node version mismatch. During region transition, client operation may not work properly as some regions will not be available. + +Configuration +------------- + +A chore service should be scheduled at HMaster to identify and recover regions that stay in the transition state for a long time. + +The following table describes the parameters for enabling this function. + +.. table:: **Table 1** Parameters + + +-----------------------------------------------+-----------------------------------------------------------------------------------------------+---------------+ + | Parameter | Description | Default Value | + +===============================================+===============================================================================================+===============+ + | hbase.region.assignment.auto.recovery.enabled | Configuration parameter used to enable/disable the region assignment recovery thread feature. | true | + +-----------------------------------------------+-----------------------------------------------------------------------------------------------+---------------+ diff --git a/doc/component-operation-guide/source/using_hbase/configuring_secure_hbase_replication.rst b/doc/component-operation-guide/source/using_hbase/configuring_secure_hbase_replication.rst new file mode 100644 index 0000000..839f7c0 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/configuring_secure_hbase_replication.rst @@ -0,0 +1,69 @@ +:original_name: mrs_01_1009.html + +.. _mrs_01_1009: + +Configuring Secure HBase Replication +==================================== + +Scenario +-------- + +This topic provides the procedure to configure the secure HBase replication during cross-realm Kerberos setup in security mode. + +Prerequisites +------------- + +- Mapping for all the FQDNs to their realms should be defined in the Kerberos configuration file. +- The passwords and keytab files of **ONE.COM** and **TWO.COM** must be the same. + +Procedure +--------- + +#. Create krbtgt principals for the two realms. + + For example, if you have two realms called **ONE.COM** and **TWO.COM**, you need to add the following principals: **krbtgt/ONE.COM@TWO.COM** and **krbtgt/TWO.COM@ONE.COM**. + + Add these two principals at both realms. + + .. code-block:: + + kadmin: addprinc -e "" krbtgt/ONE.COM@TWO.COM + kadmin: addprinc -e "" krbtgt/TWO.COM@ONE.COM + + .. note:: + + There must be at least one common keytab mode between these two realms. + +#. Add rules for creating short names in Zookeeper. + + **Dzookeeper.security.auth_to_local** is a parameter of the ZooKeeper server process. Following is an example rule that illustrates how to add support for the realm called **ONE.COM**. The principal has two members (such as **service/instance@ONE.COM**). + + .. code-block:: + + Dzookeeper.security.auth_to_local=RULE:[2:\$1@\$0](.*@\\QONE.COM\\E$)s/@\\QONE.COM\\E$//DEFAULT + + The above code example adds support for the **ONE.COM** realm in a different realm. Therefore, in the case of replication, you must add a rule for the master cluster realm in the slave cluster realm. **DEFAULT** is for defining the default rule. + +#. Add rules for creating short names in the Hadoop processes. + + The following is the **hadoop.security.auth_to_local** property in the **core-site.xml** file in the slave cluster HBase processes. For example, to add support for the **ONE.COM** realm: + + .. code-block:: + + + hadoop.security.auth_to_local + RULE:[2:$1@$0](.*@\QONE.COM\E$)s/@\QONE.COM\E$//DEFAULT + + + .. note:: + + If replication for bulkload data is enabled, then the same property for supporting the slave realm needs to be added in the **core-site.xml** file in the master cluster HBase processes. + + Example: + + .. code-block:: + + + hadoop.security.auth_to_local + RULE:[2:$1@$0](.*@\QTWO.COM\E$)s/@\QTWO.COM\E$//DEFAULT + diff --git a/doc/component-operation-guide/source/using_hbase/configuring_the_mob.rst b/doc/component-operation-guide/source/using_hbase/configuring_the_mob.rst new file mode 100644 index 0000000..6e44a36 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/configuring_the_mob.rst @@ -0,0 +1,80 @@ +:original_name: mrs_01_1631.html + +.. _mrs_01_1631: + +Configuring the MOB +=================== + +Scenario +-------- + +In the actual application scenario, data in various sizes needs to be stored, for example, image data and documents. Data whose size is smaller than 10 MB can be stored in HBase. HBase can yield the best read-and-write performance for data whose size is smaller than 100 KB. If the size of data stored in HBase is greater than 100 KB or even reaches 10 MB and the same number of data files are inserted, the total data amount is large, causing frequent compaction and split, high CPU consumption, high disk I/O frequency, and low performance. + +MOB data (100 KB to 10 MB data) is stored in a file system (such as the HDFS) in the HFile format. Files are centrally managed using the expiredMobFileCleaner and Sweeper tools. The addresses and size of files are stored in the HBase store as values. This greatly decreases the compaction and split frequency in HBase and improves performance. + +The MOB function of HBase is enabled by default. For details about related configuration items, see :ref:`Table 1 `. To use the MOB function, you need to specify the MOB mode for storing data in the specified column family when creating a table or modifying table attributes. + +.. note:: + + This section applies to MRS 3.\ *x* or later clusters. + +Configuration Description +------------------------- + +To enable the HBase MOB function, you need to specify the MOB mode for storing data in the specified column family when creating a table or modifying table attributes. + +Use code to declare that the MOB mode for storing data is used: + +.. code-block:: + + HColumnDescriptor hcd = new HColumnDescriptor("f"); + hcd.setMobEnabled(true); + +Use code to declare that the MOB mode for storing data is used, the unit of MOB_THRESHOLD is byte: + +.. code-block:: + + hbase(main):009:0> create 't3',{NAME => 'd', MOB_THRESHOLD => '102400', IS_MOB => 'true'} + + 0 row(s) in 0.3450 seconds + + => Hbase::Table - t3 + hbase(main):010:0> describe 't3' + Table t3 is ENABLED + + + t3 + + + COLUMN FAMILIES DESCRIPTION + + + {NAME => 'd', MOB_THRESHOLD => '102400', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', + TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', + IN_MEMORY => 'false', IS_MOB => 'true', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'} + + 1 row(s) in 0.0170 seconds + +**Navigation path for setting parameters:** + +On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Configurations** > **All Configurations**. Enter a parameter name in the search box. + +.. _mrs_01_1631__t19912a272c204245856b9698b6f04877: + +.. table:: **Table 1** Parameter description + + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Default Value | + +=====================================+===================================================================================================================================================================================================================================================================================================================================================+=======================+ + | hbase.mob.file.cache.size | Size of the opened file handle cache. If this parameter is set to a large value, more file handles can be cached, reducing the frequency of opening and closing files. However, if this parameter is set to a large value, too many file handles will be opened. The default value is **1000**. This parameter is configured on the ResionServer. | 1000 | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | hbase.mob.cache.evict.period | Expiration time of cached MOB files in the MOB cache, in seconds. | 3600 | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | hbase.mob.cache.evict.remain.ratio | Ratio of the number of retained files after MOB cache reclamation to the number of cached files. **hbase.mob.cache.evict.remain.ratio** is an algorithm factor. When the number of cached MOB files reaches the product of **hbase.mob.file.cache.size** **hbase.mob.cache.evict.remain.ratio**, cache reclamation is triggered. | 0.5 | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | hbase.master.mob.ttl.cleaner.period | Interval for deleting expired files, in seconds. The default value is one day (86,400 seconds). | 86400 | + | | | | + | | .. note:: | | + | | | | + | | If the validity period of an MOB file expires, that is, the file has been created for more than 24 hours, the MOB file will be deleted by the tool for deleting expired MOB files. | | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ diff --git a/doc/component-operation-guide/source/using_hbase/creating_hbase_roles.rst b/doc/component-operation-guide/source/using_hbase/creating_hbase_roles.rst new file mode 100644 index 0000000..3272b9a --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/creating_hbase_roles.rst @@ -0,0 +1,84 @@ +:original_name: mrs_01_1608.html + +.. _mrs_01_1608: + +Creating HBase Roles +==================== + +Scenario +-------- + +This section guides the system administrator to create and configure an HBase role on Manager. The HBase role can set HBase administrator permissions and read (R), write (W), create (C), execute (X), or manage (A) permissions for HBase tables and column families. + +Users can create a table, query/delete/insert/update data, and authorize others to access HBase tables after they set the corresponding permissions for the specified databases or tables on HDFS. + +.. note:: + + - This section applies to MRS 3.\ *x* or later clusters. + - HBase roles can be created in security mode, but cannot be created in normal mode. + - If the current component uses Ranger for permission control, you need to configure related policies based on Ranger for permission management. For details, see :ref:`Adding a Ranger Access Permission Policy for HBase `. + +Prerequisites +------------- + +- The system administrator has understood the service requirements. + +- You have logged in to Manager. + +Procedure +--------- + +#. On Manager, choose **System** > **Permission** > **Role**. + +#. On the displayed page, click **Create Role** and enter a **Role Name** and **Description**. + +#. Set **Permission**. For details, see :ref:`Table 1 `. + + HBase permissions: + + - HBase Scope: Authorizes HBase tables. The minimum permission is read (R) and write (W) for columns. + - HBase administrator permission: HBase administrator permissions. + + .. note:: + + Users have the read (R), write (W), create (C), execute (X), and administrate (A) permissions for the tables created by themselves. + + .. _mrs_01_1608__t873a9c44357b40cd98cb948ce9438d93: + + .. table:: **Table 1** Setting a role + + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Task | Role Authorization | + +=========================================================================+=========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | Setting the HBase administrator permission | In **Configure Resource Permission**, choose *Name of the desired cluster* > **HBase** and select **HBase Administrator Permission**. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Setting the permission for users to create tables | a. In **Configure Resource Permission**, choose *Name of the desired cluster* > **HBase** > **HBase Scope**. | + | | b. Click **global**. | + | | c. In the **Permission** column of the specified namespace, select **Create** and **Execute**. For example, select **Create** and **Execute** for the default namespace **default**. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Setting the permission for users to write data to tables | a. In **Configure Resource Permission**, choose *Name of the desired cluster* > **HBase** > **HBase Scope** > **global**. | + | | b. In the **Permission** column of the specified namespace, select **Write**. For example, select **Write** for the default namespace **default**. By default, HBase sub-objects inherit the permission from the parent object. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Setting the permission for users to read data from tables | a. In **Configure Resource Permission**, choose *Name of the desired cluster* > **HBase** > **HBase Scope** > **global**. | + | | b. In the **Permission** column of the specified namespace, select **Read**. For example, select **Read** for the default namespace **default**. By default, HBase sub-objects inherit the permission from the parent object. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Setting the permission for users to manage namespaces or tables | a. In **Configure Resource Permission**, choose *Name of the desired cluster* > **HBase** > **HBase Scope** > **global**. | + | | b. In the **Permission** column of the specified namespace, select **Manage**. For example, select **Manage** for the default namespace **default**. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Setting the permission for reading data from or writing data to columns | a. In **Configure Resource Permission**, select *Name of the desired cluster* > **HBase** > **HBase Scope** > **global** and click the specified namespace to display the tables in the namespace. | + | | | + | | b. Click a table. | + | | | + | | c. Click a column family. | + | | | + | | d. Confirm whether you want to create a role? | + | | | + | | - If yes, enter the column name in the **Resource Name** text box. Use commas (,) to separate multiple columns. Select **Read** or **Write**. If there are no columns with the same name in the HBase table, a newly created column with the same name as the existing column has the same permission as the existing one. The column permission is set successfully. | + | | - If no, modify the column permission of the existing HBase role. The columns for which the permission has been separately set are displayed in the table. Go to :ref:`5 `. | + | | | + | | e. .. _mrs_01_1608__lc2f15302f1854175993f36524c25bf26: | + | | | + | | To add column permissions for a role, enter the column name in the **Resource Name** text box and set the column permissions. To modify column permissions for a role, enter the column name in the **Resource Name** text box and set the column permissions. Alternatively, you can directly modify the column permissions in the table. If the column permissions are modified in the table and column permissions with the same name are added, the settings cannot be saved. You are advised to modify the column permission of a role directly in the table. The search function is supported. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK**, and return to the **Role** page. diff --git a/doc/component-operation-guide/source/using_hbase/enabling_cross-cluster_copy.rst b/doc/component-operation-guide/source/using_hbase/enabling_cross-cluster_copy.rst new file mode 100644 index 0000000..4073428 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/enabling_cross-cluster_copy.rst @@ -0,0 +1,68 @@ +:original_name: mrs_01_0502.html + +.. _mrs_01_0502: + +Enabling Cross-Cluster Copy +=========================== + +Scenario +-------- + +DistCp is used to copy the data stored on HDFS from a cluster to another cluster. DistCp depends on the cross-cluster copy function, which is disabled by default. This function needs to be enabled in both clusters. + +This section describes how to enable cross-cluster copy. + +Impact on the System +-------------------- + +Yarn needs to be restarted to enable the cross-cluster copy function and cannot be accessed during the restart. + +Prerequisites +------------- + +The **hadoop.rpc.protection** parameter of the two HDFS clusters must be set to the same data transmission mode, which can be **privacy** (encryption enabled) or **authentication** (encryption disabled). + +.. note:: + + Go to the **All Configurations** page by referring to :ref:`Modifying Cluster Service Configuration Parameters ` and search for **hadoop.rpc.protection**. + + For versions earlier than MRS 3.x, choose **Components** > **HDFS** > **Service Configuration** on the cluster details page. Switch **Basic** to **All**, and search for **hadoop.rpc.protection**. + +Procedure +--------- + +#. Log in to the service page. + + For versions earlier than MRS 1.9.2: Log in to `MRS Manager `__, and choose **Services**. + + For MRS 1.9.2 or later: Click the cluster name on the MRS console and choose **Components**. + +#. Go to the **All Configurations** page of the Yarn service. For details, see :ref:`Modifying Cluster Service Configuration Parameters `. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + +#. In the navigation pane, choose **Yarn** > **Distcp**. + +#. Set **haclusterX.remotenn1** of **dfs.namenode.rpc-address** to the service IP address and RPC port number of one NameNode instance of the peer cluster, and set **haclusterX.remotenn2** to the service IP address and RPC port number of the other NameNode instance of the peer cluster. Enter a value in the *IP address:port* format. + + .. note:: + + For MRS 1.9.2 or later, log in to the MRS console, click the cluster name, and choose **Components** > **HDFS** > **Instances** to obtain the service IP address of the NameNode instance. + + You can also log in to FusionInsight Manager in MRS 3.x clusters, and choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **Instance** to obtain the service IP address of the NameNode instance. + + **dfs.namenode.rpc-address.haclusterX.remotenn1** and **dfs.namenode.rpc-address.haclusterX.remotenn2** do not distinguish active and standby NameNode instances. The default NameNode RPC port is 9820 and cannot be modified on MRS Manager. + + For example, **10.1.1.1:9820** and **10.1.1.2:9820**. + + .. note:: + + For MRS 1.6.2 or earlier, the default port number is **25000**. For details, see `List of Open Source Component Ports `__. + +#. Save the configuration. On the **Dashboard** tab page, and choose **More** > **Restart Service** to restart the Yarn service. + + **Operation succeeded** is displayed. Click **Finish**. The Yarn service is started successfully. + +#. Log in to the other cluster and repeat the preceding operations. diff --git a/doc/component-operation-guide/source/using_hbase/geomesa_command_line.rst b/doc/component-operation-guide/source/using_hbase/geomesa_command_line.rst new file mode 100644 index 0000000..4dca0c8 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/geomesa_command_line.rst @@ -0,0 +1,136 @@ +:original_name: mrs_01_24119.html + +.. _mrs_01_24119: + +GeoMesa Command Line +==================== + +.. note:: + + This section applies only to MRS 3.1.0 or later. + +This section describes common GeoMesa commands. For more GeoMesa commands, visit https://www.geomesa.org/documentation/user/accumulo/commandline.html. + +After installing the HBase client and loading environment variables, you can use the geomesa-hbase command line. + +- Viewing **classpath** + + After you run the **classpath** command, all **classpath** information of the current command line tool will be returned. + + **bin/geomesa-hbase classpath** + +- Creating a table + + Run the **create-schema** command to create a table. When creating a table, you need to specify the directory name, table name, and table specifications at least. + + **bin/geomesa-hbase create-schema -c geomesa -f test -s Who:String,What:java.lang.Long,When:Date,*Where:Point:srid=4326,Why:String** + +- Describing a table + + Run the **describe-schema** command to obtain table descriptions. When describing a table, you need to specify the directory name and table name. + + **bin/geomesa-hbase describe-schema -c geomesa -f test** + +- Importing data in batches + + Run the **ingest** command to import data in batches. When importing data, you need to specify the directory name, table name, table specifications, and the related data converter. + + The data in the **data.csv** file contains license plate number, vehicle color, longitude, latitude, and time. Save the data table to the folder. + + .. code-block:: + + AAA,red,113.918417,22.505892,2017-04-09 18:03:46 + BBB,white,113.960719,22.556511,2017-04-24 07:38:47 + CCC,blue,114.088333,22.637222,2017-04-23 15:07:54 + DDD,yellow,114.195456,22.596103,2017-04-21 21:27:06 + EEE,black,113.897614,22.551331,2017-04-09 09:34:48 + + Table structure definition: **myschema.sft**. Save **myschema.sft** to the **conf** folder of the GeoMesa command line tool. + + .. code-block:: + + geomesa.sfts.cars = { + attributes = [ + { name = "carid", type = "String", index = true } + { name = "color", type = "String", index = false } + { name = "time", type = "Date", index = false } + { name = "geom", type = "Point", index = true,srid = 4326,default = true } + ] + } + + Converter definition: **myconvertor.convert** Save **myconvertor.convert** to the **conf** folder of the GeoMesa command line tool. + + .. code-block:: + + geomesa.converters.cars= { + type = "delimited-text", + format = "CSV", + id-field = "$fid", + fields = [ + { name = "fid", transform = "concat($1,$5)" } + { name = "carid", transform = "$1::string" } + { name = "color", transform = "$2::string" } + { name = "lon", transform = "$3::double" } + { name = "lat", transform = "$4::double" } + { name = "geom", transform = "point($lon,$lat)" } + { name = "time", transform = "date('YYYY-MM-dd HH:mm:ss',$5)" } + ] + } + + Run the following command to import data: + + **bin/geomesa-hbase ingest -c geomesa -C conf/myconvertor.convert -s conf/myschema.sft data/data.csv** + + For details about other parameters for importing data, visit https://www.geomesa.org/documentation/user/accumulo/examples.html#ingesting-data. + +- Querying explanations + + Run the **explain** command to obtain execution plan explanations of the specified query statement. You need to specify the directory name, table name, and query statement. + + **bin/geomesa-hbase explain -c geomesa -f cars -q "carid = 'BBB'"** + +- Analyzing statistics + + Run the **stats-analyze** command to conduct statistical analysis on the data table. In addition, you can run the **stats-bounds**, **stats-count**, **stats-histogram**, and **stats-top-k** commands to collect more detailed statistics on the data table. + + **bin/geomesa-hbase stats-analyze -c geomesa -f cars** + + **bin/geomesa-hbase stats-bounds -c geomesa -f cars** + + **bin/geomesa-hbase stats-count -c geomesa -f cars** + + **bin/geomesa-hbase stats-histogram -c geomesa -f cars** + + **bin/geomesa-hbase stats-top-k -c geomesa -f cars** + +- Exporting a feature + + Run the **export** command to export a feature. When exporting the feature, you must specify the directory name and table name. In addition, you can specify a query statement to export the feature. + + **bin/geomesa-hbase export -c geomesa -f cars -q "carid = 'BBB'"** + +- Deleting a feature + + Run the **delete-features** command to delete a feature. When deleting the feature, you must specify the directory name and table name. In addition, you can specify a query statement to delete the feature. + + **bin/geomesa-hbase delete-features -c geomesa -f cars -q "carid = 'BBB'"** + +- Obtain the names of all tables in the directory. + + Run the **get-type-names** command to obtain the names of tables in the specified directory. + + **bin/geomesa-hbase get-type-names -c geomesa** + +- Deleting a table + + Run the **remove-schema** command to delete a table. You need to specify the directory name and table name at least. + + **bin/geomesa-hbase remove-schema -c geomesa -f test** + + **bin/geomesa-hbase remove-schema -c geomesa -f cars** + +- Deleting a catalog + + Run the **delete-catalog** command to delete the specified catalog. + + **bin/geomesa-hbase delete-catalog -c geomesa** diff --git a/doc/component-operation-guide/source/using_hbase/hbase_log_overview.rst b/doc/component-operation-guide/source/using_hbase/hbase_log_overview.rst new file mode 100644 index 0000000..611b3e9 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/hbase_log_overview.rst @@ -0,0 +1,98 @@ +:original_name: mrs_01_1056.html + +.. _mrs_01_1056: + +HBase Log Overview +================== + +Log Description +--------------- + +**Log path**: The default storage path of HBase logs is **/var/log/Bigdata/hbase/**\ *Role name*. + +- HMaster: **/var/log/Bigdata/hbase/hm** (run logs) and **/var/log/Bigdata/audit/hbase/hm** (audit logs) +- RegionServer: **/var/log/Bigdata/hbase/rs** (run logs) and **/var/log/Bigdata/audit/hbase/rs** (audit logs) +- ThriftServer: **/var/log/Bigdata/hbase/ts2** (run logs, **ts2** is the instance name) and **/var/log/Bigdata/audit/hbase/ts2** (audit logs, **ts2** is the instance name) + +**Log archive rule**: The automatic log compression and archiving function of HBase is enabled. By default, when the size of a log file exceeds 30 MB, the log file is automatically compressed. The naming rule of a compressed log file is as follows: <*Original log name*>-<*yyyy-mm-dd_hh-mm-ss*>.[*ID*].\ **log.zip** A maximum of 20 latest compressed files are reserved. The number of compressed files can be configured on the Manager portal. + +.. table:: **Table 1** HBase log list + + +------------+------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | Type | Name | Description | + +============+================================================+===============================================================================================================================+ + | Run logs | hbase---.log | HBase system log that records the startup time, startup parameters, and most logs generated when the HBase system is running. | + +------------+------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | hbase---.out | Log that records the HBase running environment information. | + +------------+------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | ----gc.log | Log that records HBase junk collections. | + +------------+------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | checkServiceDetail.log | Log that records whether the HBase service starts successfully. | + +------------+------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | hbase.log | Log generated when the HBase service health check script and some alarm check scripts are executed. | + +------------+------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | sendAlarm.log | Log that records alarms reported after execution of HBase alarm check scripts. | + +------------+------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | hbase-haCheck.log | Log that records the active and standby status of HMaster | + +------------+------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | | stop.log | Log that records the startup and stop processes of HBase. | + +------------+------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + | Audit logs | hbase-audit-.log | Log that records HBase security audit. | + +------------+------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ + +Log Level +--------- + +:ref:`Table 2 ` describes the log levels supported by HBase. The priorities of log levels are FATAL, ERROR, WARN, INFO, and DEBUG in descending order. Logs whose levels are higher than or equal to the specified level are printed. The number of printed logs decreases as the specified log level increases. + +.. _mrs_01_1056__tbb25f33f364f4d2d8e14cd48d9f8dd0b: + +.. table:: **Table 2** Log levels + + +-------+------------------------------------------------------------------------------------------------------------------------------------------+ + | Level | Description | + +=======+==========================================================================================================================================+ + | FATAL | Logs of this level record fatal error information about the current event processing that may result in a system crash. | + +-------+------------------------------------------------------------------------------------------------------------------------------------------+ + | ERROR | Logs of this level record error information about the current event processing, which indicates that system running is abnormal. | + +-------+------------------------------------------------------------------------------------------------------------------------------------------+ + | WARN | Logs of this level record abnormal information about the current event processing. These abnormalities will not result in system faults. | + +-------+------------------------------------------------------------------------------------------------------------------------------------------+ + | INFO | Logs of this level record normal running status information about the system and events. | + +-------+------------------------------------------------------------------------------------------------------------------------------------------+ + | DEBUG | Logs of this level record the system information and system debugging information. | + +-------+------------------------------------------------------------------------------------------------------------------------------------------+ + +To modify log levels, perform the following operations: + +#. Go to the **All Configurations** page of the HBase service. For details, see :ref:`Modifying Cluster Service Configuration Parameters `. +#. On the left menu bar, select the log menu of the target role. +#. Select a desired log level. +#. Save the configuration. In the displayed dialog box, click **OK** to make the configurations take effect. + + .. note:: + + The configurations take effect immediately without the need to restart the service. + +Log Formats +----------- + +The following table lists the HBase log formats. + +.. table:: **Table 3** Log formats + + +------------+--------------+------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Type | Component | Format | Example | + +============+==============+==============================================================================================================================+======================================================================================================================================================================================================================+ + | Run logs | HMaster | ||<*Thread that generates the log*>|<*Message in the log*>|<*Location of the log event*> | 2020-01-19 16:04:53,558 \| INFO \| main \| env:HBASE_THRIFT_OPTS= \| org.apache.hadoop.hbase.util.ServerCommandLine.logProcessInfo(ServerCommandLine.java:113) | + +------------+--------------+------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | RegionServer | ||<*Thread that generates the log*>|<*Message in the log*>|<*Location of the log event*> | 2020-01-19 16:05:18,589 \| INFO \| regionserver16020-SendThread(linux-k6da:2181) \| Client will use GSSAPI as SASL mechanism. \| org.apache.zookeeper.client.ZooKeeperSaslClient$1.run(ZooKeeperSaslClient.java:285) | + +------------+--------------+------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ThriftServer | ||<*Thread that generates the log*>|<*Message in the log*>|<*Location of the log event*> | 2020-02-16 09:42:55,371 \| INFO \| main \| loaded properties from hadoop-metrics2.properties \| org.apache.hadoop.metrics2.impl.MetricsConfig.loadFirst(MetricsConfig.java:111) | + +------------+--------------+------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Audit logs | HMaster | ||<*Thread that generates the log*>|<*Message in the log*>|<*Location of the log event*> | 2020-02-16 09:42:40,934 \| INFO \| master:linux-k6da:16000 \| Master: [master:linux-k6da:16000] start operation called. \| org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:581) | + +------------+--------------+------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | RegionServer | ||<*Thread that generates the log*>|<*Message in the log*>|<*Location of the log event*> | 2020-02-16 09:42:51,063 \| INFO \| main \| RegionServer: [regionserver16020] start operation called. \| org.apache.hadoop.hbase.regionserver.HRegionServer.startRegionServer(HRegionServer.java:2396) | + +------------+--------------+------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ThriftServer | ||<*Thread that generates the log*>|<*Message in the log*>|<*Location of the log event*> | 2020-02-16 09:42:55,512 \| INFO \| main \| thrift2 server start operation called. \| org.apache.hadoop.hbase.thrift2.ThriftServer.main(ThriftServer.java:421) | + +------------+--------------+------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/improving_put_performance.rst b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/improving_put_performance.rst new file mode 100644 index 0000000..8dbd9d5 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/improving_put_performance.rst @@ -0,0 +1,38 @@ +:original_name: mrs_01_1637.html + +.. _mrs_01_1637: + +Improving Put Performance +========================= + +Scenario +-------- + +In the scenario where a large number of requests are continuously put, setting the following two parameters to **false** can greatly improve the Put performance. + +- **hbase.regionserver.wal.durable.sync** + +- **hbase.regionserver.hfile.durable.sync** + +When the performance is improved, there is a low probability that data is lost if three DataNodes are faulty at the same time. Exercise caution when configuring the parameters in scenarios that have high requirements on data reliability. + +.. note:: + + This section applies to MRS 3.\ *x* and later versions. + +Procedure +--------- + +Navigation path for setting parameters: + +On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HBase** > **Configurations** > **All Configurations**. Enter the parameter name in the search box, and change the value. + +.. table:: **Table 1** Parameters for improving put performance + + +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+ + | Parameter | Description | Value | + +===================+=====================================================================================================================================================================================================================================+=======+ + | hbase.wal.hsync | Specifies whether to enable WAL file durability to make the WAL data persistence on disks. If this parameter is set to **true**, the performance is affected because each WAL file is synchronized to the disk by the Hadoop fsync. | false | + +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+ + | hbase.hfile.hsync | Specifies whether to enable the HFile durability to make data persistence on disks. If this parameter is set to true, the performance is affected because each Hfile file is synchronized to the disk by the Hadoop fsync. | false | + +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+ diff --git a/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/improving_real-time_data_read_performance.rst b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/improving_real-time_data_read_performance.rst new file mode 100644 index 0000000..fc7db45 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/improving_real-time_data_read_performance.rst @@ -0,0 +1,89 @@ +:original_name: mrs_01_1018.html + +.. _mrs_01_1018: + +Improving Real-time Data Read Performance +========================================= + +Scenario +-------- + +HBase data needs to be read. + +Prerequisites +------------- + +The get or scan interface of HBase has been invoked and data is read in real time from HBase. + +Procedure +--------- + +- **Data reading server tuning** + + Parameter portal: + + Go to the **All Configurations** page of the HBase service. For details, see :ref:`Modifying Cluster Service Configuration Parameters `. + + .. table:: **Table 1** Configuration items that affect real-time data reading + + +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Default Value | + +==================================+=================================================================================================================================================================================================================================================================================================================================================================+=======================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | GC_OPTS | You can increase HBase memory to improve HBase performance because read and write operations are performed in HBase memory. | For versions earlier than MRS 3.x: | + | | | | + | | **HeapSize** and **NewSize** need to be adjusted. When you adjust **HeapSize**, set **Xms** and **Xmx** to the same value to avoid performance problems when JVM dynamically adjusts **HeapSize**. Set **NewSize** to 1/8 of **HeapSize**. | - HMaster: | + | | | | + | | - **HMaster**: If HBase clusters enlarge and the number of Regions grows, properly increase the **GC_OPTS** parameter value of the HMaster. | -server -Xms2G -Xmx2G -XX:NewSize=256M -XX:MaxNewSize=256M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=512M -XX:MaxDirectMemorySize=512M -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+PrintGCDetails -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFE -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M | + | | - **RegionServer**: A RegionServer needs more memory than an HMaster. If sufficient memory is available, increase the **HeapSize** value. | | + | | | - RegionServer: | + | | .. note:: | | + | | | -server -Xms4G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=512M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=512M -XX:MaxDirectMemorySize=512M -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+PrintGCDetails -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFE -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M | + | | When the value of **HeapSize** for the active HMaster is 4 GB, the HBase cluster can support 100,000 regions. Empirically, each time 35,000 regions are added to the cluster, the value of **HeapSize** must be increased by 2 GB. It is recommended that the value of **HeapSize** for the active HMaster not exceed 32 GB. | | + | | | For MRS 3.\ *x* or later: | + | | | | + | | | - HMaster | + | | | | + | | | -server -Xms4G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=512M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=512M -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+PrintGCDetails -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFE -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M | + | | | | + | | | - Region Server | + | | | | + | | | -server -Xms6G -Xmx6G -XX:NewSize=1024M -XX:MaxNewSize=1024M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=512M -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+PrintGCDetails -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFE -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M | + +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hbase.regionserver.handler.count | Indicates the number of requests that RegionServer can process concurrently. If the parameter is set to an excessively large value, threads will compete fiercely. If the parameter is set to an excessively small value, requests will be waiting for a long time in RegionServer, reducing the processing capability. You can add threads based on resources. | 200 | + | | | | + | | It is recommended that the value be set to 100 to 300 based on the CPU usage. | | + +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hfile.block.cache.size | HBase cache sizes affect query efficiency. Set cache sizes based on query modes and query record distribution. If random query is used to reduce the hit ratio of the buffer, you can reduce the buffer size. | When **offheap** is disabled, the default value is **0.25**. When **offheap** is enabled, the default value is **0.1**. | + +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. note:: + + If read and write operations are performed at the same time, the performance of the two operations affects each other. If flush and compaction operations are frequently performed due to data writes, a large number of disk I/O operations are occupied, affecting read performance. If a large number of compaction operations are blocked due to write operations, multiple HFiles exist in the region, affecting read performance. Therefore, if the read performance is unsatisfactory, you need to check whether the write configurations are proper. + +- **Data reading client tuning** + + When scanning data, you need to set **caching** (the number of records read from the server at a time. The default value is **1**.). If the default value is used, the read performance will be extremely low. + + If you do not need to read all columns of a piece of data, specify the columns to be read to reduce network I/O. + + If you only need to read the row key, add a filter (FirstKeyOnlyFilter or KeyOnlyFilter) that only reads the row key. + +- **Data table reading design optimization** + + .. table:: **Table 2** Parameters affecting real-time data reading + + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Default Value | + +=======================+==========================================================================================================================================================================================================================================================================+=======================+ + | COMPRESSION | The compression algorithm compresses blocks in HFiles. For compressible data, configure the compression algorithm to efficiently reduce disk I/Os and improve performance. | NONE | + | | | | + | | .. note:: | | + | | | | + | | Some data cannot be efficiently compressed. For example, a compressed figure can hardly be compressed again. The common compression algorithm is SNAPPY, because it has a high encoding/decoding speed and acceptable compression rate. | | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | BLOCKSIZE | Different block sizes affect HBase data read and write performance. You can configure sizes for blocks in an HFile. Larger blocks have a higher compression rate. However, they have poor performance in random data read, because HBase reads data in a unit of blocks. | 65536 | + | | | | + | | Set the parameter to 128 KB or 256 KB to improve data write efficiency without greatly affecting random read performance. The unit is byte. | | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | DATA_BLOCK_ENCODING | Encoding method of the block in an HFile. If a row contains multiple columns, set **FAST_DIFF** to save data storage space and improve performance. | NONE | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ diff --git a/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/improving_real-time_data_write_performance.rst b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/improving_real-time_data_write_performance.rst new file mode 100644 index 0000000..a3d4bd2 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/improving_real-time_data_write_performance.rst @@ -0,0 +1,123 @@ +:original_name: mrs_01_1017.html + +.. _mrs_01_1017: + +Improving Real-time Data Write Performance +========================================== + +Scenario +-------- + +Scenarios where data needs to be written to HBase in real time, or large-scale and consecutive put scenarios + +.. note:: + + This section applies to MRS 3.\ *x* and later versions. + +Prerequisites +------------- + +The HBase put or delete interface can be used to save data to HBase. + +Procedure +--------- + +- **Data writing server tuning** + + Parameter portal: + + Go to the **All Configurations** page of the HBase service. For details, see :ref:`Modifying Cluster Service Configuration Parameters `. + + .. table:: **Table 1** Configuration items that affect real-time data writing + + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Default Value | + +===============================================+================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | hbase.wal.hsync | Controls the synchronization degree when HLogs are written to the HDFS. If the value is **true**, HDFS returns only when data is written to the disk. If the value is **false**, HDFS returns when data is written to the OS cache. | true | + | | | | + | | Set the parameter to **false** to improve write performance. | | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hbase.hfile.hsync | Controls the synchronization degree when HFiles are written to the HDFS. If the value is **true**, HDFS returns only when data is written to the disk. If the value is **false**, HDFS returns when data is written to the OS cache. | true | + | | | | + | | Set the parameter to **false** to improve write performance. | | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | GC_OPTS | You can increase HBase memory to improve HBase performance because read and write operations are performed in HBase memory. **HeapSize** and **NewSize** need to be adjusted. When you adjust **HeapSize**, set **Xms** and **Xmx** to the same value to avoid performance problems when JVM dynamically adjusts **HeapSize**. Set **NewSize** to 1/8 of **HeapSize**. | - HMaster | + | | | | + | | - **HMaster**: If HBase clusters enlarge and the number of Regions grows, properly increase the **GC_OPTS** parameter value of the HMaster. | -server -Xms4G -Xmx4G -XX:NewSize=512M -XX:MaxNewSize=512M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=512M -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+PrintGCDetails -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFE -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M | + | | - **RegionServer**: A RegionServer needs more memory than an HMaster. If sufficient memory is available, increase the **HeapSize** value. | | + | | | - Region Server | + | | .. note:: | | + | | | -server -Xms6G -Xmx6G -XX:NewSize=1024M -XX:MaxNewSize=1024M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=512M -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+PrintGCDetails -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFE -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M | + | | When the value of **HeapSize** for the active HMaster is 4 GB, the HBase cluster can support 100,000 regions. Empirically, each time 35,000 regions are added to the cluster, the value of **HeapSize** must be increased by 2 GB. It is recommended that the value of **HeapSize** for the active HMaster not exceed 32 GB. | | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hbase.regionserver.handler.count | Indicates the number of RPC server instances started on RegionServer. If the parameter is set to an excessively large value, threads will compete fiercely. If the parameter is set to an excessively small value, requests will be waiting for a long time in RegionServer, reducing the processing capability. You can add threads based on resources. | 200 | + | | | | + | | It is recommended that the value be set to **100** to **300** based on the CPU usage. | | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hbase.hregion.max.filesize | Indicates the maximum size of an HStoreFile, in bytes. If the size of any HStoreFile exceeds the value of this parameter, the managed Hregion is divided into two parts. | 10737418240 | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hbase.hregion.memstore.flush.size | On the RegionServer, when the size of memstore that exists in memory of write operations exceeds **memstore.flush.size**, MemStoreFlusher performs the Flush operation to write the memstore to the corresponding store in the format of HFile. | 134217728 | + | | | | + | | If RegionServer memory is sufficient and active Regions are few, increase the parameter value and reduce compaction times to improve system performance. | | + | | | | + | | The Flush operation may be delayed after it takes place. Write operations continue and memstore keeps increasing during the delay. The maximum size of memstore is: **memstore.flush.size** x **hbase.hregion.memstore.block.multiplier**. When the memstore size exceeds the maximum value, write operations are blocked. Properly increasing the value of **hbase.hregion.memstore.block.multiplier** can reduce the blocks and make performance become more stable. Unit: byte | | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hbase.regionserver.global.memstore.size | Updates the size of all MemStores supported by the RegionServer before locking or forcible flush. On the RegionServer, the MemStoreFlusher thread performs the flush. The thread regularly checks memory occupied by write operations. When the total memory volume occupied by write operations exceeds the threshold, MemStoreFlusher performs the flush. Larger memstore will be flushed first and then smaller ones until the occupied memory is less than the threshold. | 0.4 | + | | | | + | | Threshold = hbase.regionserver.global.memstore.size x hbase.regionserver.global.memstore.size.lower.limit x HBase_HEAPSIZE | | + | | | | + | | .. note:: | | + | | | | + | | The sum of the parameter value and the value of **hfile.block.cache.size** cannot exceed 0.8, that is, memory occupied by read and write operations cannot exceed 80% of **HeapSize**, ensuring stable running of other operations. | | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hbase.hstore.blockingStoreFiles | Check whether the number of files is larger than the value of **hbase.hstore.blockingStoreFiles** before you flush regions. | 15 | + | | | | + | | If it is larger than the value of **hbase.hstore.blockingStoreFiles**, perform a compaction and configure **hbase.hstore.blockingWaitTime** to 90s to make the flush delay for 90s. During the delay, write operations continue and the memstore size keeps increasing and exceeds the threshold (**memstore.flush.size** x **hbase.hregion.memstore.block.multiplier**), blocking write operations. After compaction is complete, a large number of writes may be generated. As a result, the performance fluctuates sharply. | | + | | | | + | | Increase the value of **hbase.hstore.blockingStoreFiles** to reduce block possibilities. | | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hbase.regionserver.thread.compaction.throttle | The compression whose size is greater than the value of this parameter is executed by the large thread pool. The unit is bytes. Indicates a threshold of a total file size for compaction during a Minor Compaction. The total file size affects execution duration of a compaction. If the total file size is large, other compactions or flushes may be blocked. | 1610612736 | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hbase.hstore.compaction.min | Indicates the minimum number of HStoreFiles on which minor compaction is performed each time. When the size of a file in a Store exceeds the value of this parameter, the file is compacted. You can increase the value of this parameter to reduce the number of times that the file is compacted. If there are too many files in the Store, read performance will be affected. | 6 | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hbase.hstore.compaction.max | Indicates the maximum number of HStoreFiles on which minor compaction is performed each time. The functions of the parameter and **hbase.hstore.compaction.max.size** are similar. Both are used to limit the execution duration of one compaction. | 10 | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hbase.hstore.compaction.max.size | If the size of an HFile is larger than the parameter value, the HFile will not be compacted in a Minor Compaction but can be compacted in a Major Compaction. | 9223372036854775807 | + | | | | + | | The parameter is used to prevent HFiles of large sizes from being compacted. After a Major Compaction is forbidden, multiple HFiles can exist in a Store and will not be merged into one HFile, without affecting data access performance. The unit is byte. | | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | hbase.hregion.majorcompaction | Main compression interval of all HStoreFile files in a region. The unit is milliseconds. Execution of Major Compactions consumes much system resources and will affect system performance during peak hours. | 604800000 | + | | | | + | | If service updates, deletion, and reclamation of expired data space are infrequent, set the parameter to **0** to disable Major Compactions. | | + | | | | + | | If you must perform a Major Compaction to reclaim more space, increase the parameter value and configure the **hbase.offpeak.end.hour** and **hbase.offpeak.start.hour** parameters to make the Major Compaction be triggered in off-peak hours. | | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | - hbase.regionserver.maxlogs | - Indicates the threshold for the number of HLog files that are not flushed on a RegionServer. If the number of HLog files is greater than the threshold, the RegionServer forcibly performs flush operations. | - 32 | + | - hbase.regionserver.hlog.blocksize | - Indicates the maximum size of an HLog file. If the size of an HLog file is greater than the value of this parameter, a new HLog file is generated. The old HLog file is disabled and archived. | - 134217728 | + | | | | + | | The two parameters determine the number of HLogs that are not flushed in a RegionServer. When the data volume is less than the total size of memstore, the flush operation is forcibly triggered due to excessive HLog files. In this case, you can adjust the values of the two parameters to avoid forcible flush. Unit: byte | | + +-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +- **Data writing client tuning** + + It is recommended that data is written in Put List mode if necessary, which greatly improves write performance. The length of each put list needs to be set based on the single put size and parameters of the actual environment. You are advised to do some basic tests before configuring parameters. + +- **Data table writing design optimization** + + .. table:: **Table 2** Parameters affecting real-time data writing + + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Default Value | + +=======================+==========================================================================================================================================================================================================================================================================+=======================+ + | COMPRESSION | The compression algorithm compresses blocks in HFiles. For compressible data, configure the compression algorithm to efficiently reduce disk I/Os and improve performance. | NONE | + | | | | + | | .. note:: | | + | | | | + | | Some data cannot be efficiently compressed. For example, a compressed figure can hardly be compressed again. The common compression algorithm is SNAPPY, because it has a high encoding/decoding speed and acceptable compression rate. | | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | BLOCKSIZE | Different block sizes affect HBase data read and write performance. You can configure sizes for blocks in an HFile. Larger blocks have a higher compression rate. However, they have poor performance in random data read, because HBase reads data in a unit of blocks. | 65536 | + | | | | + | | Set the parameter to 128 KB or 256 KB to improve data write efficiency without greatly affecting random read performance. The unit is byte. | | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | IN_MEMORY | Whether to cache table data in the memory first, which improves data read performance. If you will frequently access some small tables, set the parameter. | false | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ diff --git a/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/improving_the_bulkload_efficiency.rst b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/improving_the_bulkload_efficiency.rst new file mode 100644 index 0000000..ed89563 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/improving_the_bulkload_efficiency.rst @@ -0,0 +1,39 @@ +:original_name: mrs_01_1636.html + +.. _mrs_01_1636: + +Improving the BulkLoad Efficiency +================================= + +Scenario +-------- + +BulkLoad uses MapReduce jobs to directly generate files that comply with the internal data format of HBase, and then loads the generated StoreFiles to a running cluster. Compared with HBase APIs, BulkLoad saves more CPU and network resources. + +ImportTSV is an HBase table data loading tool. + +.. note:: + + This section applies to MRS 3.\ *x* and later versions. + +Prerequisites +------------- + +When using BulkLoad, the output path of the file has been specified using the **Dimporttsv.bulk.output** parameter. + +Procedure +--------- + +Add the following parameter to the BulkLoad command when performing a batch loading task: + +.. table:: **Table 1** Parameter for improving BulkLoad efficiency + + +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------+ + | Parameter | Description | Value | + +==========================+=================================================================================================================================================================================================================================================================================================================================+=========================================================+ + | -Dimporttsv.mapper.class | The construction of key-value pairs is moved from the user-defined mapper to reducer to improve performance. The mapper only needs to send the original text in each row to the reducer. The reducer parses the record in each row and creates a key-value) pair. | org.apache.hadoop.hbase.mapreduce.TsvImporterByteMapper | + | | | | + | | .. note:: | and | + | | | | + | | When this parameter is set to **org.apache.hadoop.hbase.mapreduce.TsvImporterByteMapper**, this parameter is used only when the batch loading command without the *HBASE_CELL_VISIBILITY OR HBASE_CELL_TTL* option is executed. The **org.apache.hadoop.hbase.mapreduce.TsvImporterByteMapper** provides better performance. | org.apache.hadoop.hbase.mapreduce.TsvImporterTextMapper | + +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/index.rst b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/index.rst new file mode 100644 index 0000000..d8dad88 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/index.rst @@ -0,0 +1,24 @@ +:original_name: mrs_01_1013.html + +.. _mrs_01_1013: + +HBase Performance Tuning +======================== + +- :ref:`Improving the BulkLoad Efficiency ` +- :ref:`Improving Put Performance ` +- :ref:`Optimizing Put and Scan Performance ` +- :ref:`Improving Real-time Data Write Performance ` +- :ref:`Improving Real-time Data Read Performance ` +- :ref:`Optimizing JVM Parameters ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + improving_the_bulkload_efficiency + improving_put_performance + optimizing_put_and_scan_performance + improving_real-time_data_write_performance + improving_real-time_data_read_performance + optimizing_jvm_parameters diff --git a/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/optimizing_jvm_parameters.rst b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/optimizing_jvm_parameters.rst new file mode 100644 index 0000000..cf36410 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/optimizing_jvm_parameters.rst @@ -0,0 +1,46 @@ +:original_name: mrs_01_1019.html + +.. _mrs_01_1019: + +Optimizing JVM Parameters +========================= + +Scenario +-------- + +When the number of clusters reaches a certain scale, the default settings of the Java virtual machine (JVM) cannot meet the cluster requirements. In this case, the cluster performance deteriorates or the clusters may be unavailable. Therefore, JVM parameters must be properly configured based on actual service conditions to improve the cluster performance. + +Procedure +--------- + +**Navigation path for setting parameters:** + +The JVM parameters related to the HBase role must be configured in the **hbase-env.sh** file in the **${BIGDATA_HOME}/FusionInsight_HD_*/install/FusionInsight-HBase-2.2.3/hbase/conf/** directory of the node where the HBase service is installed. + +Each role has JVM parameter configuration variables, as shown in :ref:`Table 1 `. + +.. _mrs_01_1019__t2451c7af790c44cc8f895f6d4dc68b55: + +.. table:: **Table 1** HBase-related JVM parameter configuration variables + + +-------------------------+----------------------------------------------------------------+ + | Variable | Affected Role | + +=========================+================================================================+ + | HBASE_OPTS | All roles of HBase | + +-------------------------+----------------------------------------------------------------+ + | SERVER_GC_OPTS | All roles on the HBase server, such as Master and RegionServer | + +-------------------------+----------------------------------------------------------------+ + | CLIENT_GC_OPTS | Client process of HBase | + +-------------------------+----------------------------------------------------------------+ + | HBASE_MASTER_OPTS | Master of HBase | + +-------------------------+----------------------------------------------------------------+ + | HBASE_REGIONSERVER_OPTS | RegionServer of HBase | + +-------------------------+----------------------------------------------------------------+ + | HBASE_THRIFT_OPTS | Thrift of HBase | + +-------------------------+----------------------------------------------------------------+ + +**Configuration example:** + +.. code-block:: + + export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS" diff --git a/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/optimizing_put_and_scan_performance.rst b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/optimizing_put_and_scan_performance.rst new file mode 100644 index 0000000..a5c6378 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/hbase_performance_tuning/optimizing_put_and_scan_performance.rst @@ -0,0 +1,91 @@ +:original_name: mrs_01_1016.html + +.. _mrs_01_1016: + +Optimizing Put and Scan Performance +=================================== + +Scenario +-------- + +HBase has many configuration parameters related to read and write performance. The configuration parameters need to be adjusted based on the read/write request loads. This section describes how to optimize read and write performance by modifying the RegionServer configurations. + +.. note:: + + This section applies to MRS 3.\ *x* and later versions. + +Procedure +--------- + +- JVM GC parameters + + Suggestions on setting the RegionServer **GC_OPTS** parameter: + + - Set **-Xms** and **-Xmx** to the same value based on your needs. Increasing the memory can improve the read and write performance. For details, see the description of **hfile.block.cache.size** in :ref:`Table 2 ` and **hbase.regionserver.global.memstore.size** in :ref:`Table 1 `. + - Set **-XX:NewSize** and **-XX:MaxNewSize** to the same value. You are advised to set the value to **512M** in low-load scenarios and **2048M** in high-load scenarios. + - Set **X-XX:CMSInitiatingOccupancyFraction** to be less than and equal to 90, and it is calculated as follows: **100 x (hfile.block.cache.size + hbase.regionserver.global.memstore.size + 0.05)**. + - **-XX:MaxDirectMemorySize** indicates the non-heap memory used by the JVM. You are advised to set this parameter to **512M** in low-load scenarios and **2048M** in high-load scenarios. + + .. note:: + + The **-XX:MaxDirectMemorySize** parameter is not used by default. If you need to set this parameter, add it to the **GC_OPTS** parameter. + +- Put parameters + + RegionServer processes the data of the put request and writes the data to memstore and HLog. + + - When the size of memstore reaches the value of **hbase.hregion.memstore.flush.size**, memstore is updated to HDFS to generate HFiles. + - Compaction is triggered when the number of HFiles in the column cluster of the current region reaches the value of **hbase.hstore.compaction.min**. + - If the number of HFiles in the column cluster of the current region reaches the value of **hbase.hstore.blockingStoreFiles**, the operation of refreshing the memstore and generating HFiles is blocked. As a result, the put request is blocked. + + .. _mrs_01_1016__t5194159aa9d34637bba4cdd0aa3b925e: + + .. table:: **Table 1** Put parameters + + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Default Value | + +============================================+====================================================================================================================================================================================================================================================================================================================================================================================================+=======================+ + | hbase.wal.hsync | Indicates whether each WAL is persistent to disks. | true | + | | | | + | | For details, see :ref:`Improving Put Performance `. | | + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | hbase.hfile.hsync | Indicates whether HFile write operations are persistent to disks. | true | + | | | | + | | For details, see :ref:`Improving Put Performance `. | | + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | hbase.hregion.memstore.flush.size | If the size of MemStore (unit: Byte) exceeds a specified value, MemStore is flushed to the corresponding disk. The value of this parameter is checked by each thread running **hbase.server.thread.wakefrequency**. It is recommended that you set this parameter to an integer multiple of the HDFS block size. You can increase the value if the memory is sufficient and the put load is heavy. | 134217728 | + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | hbase.regionserver.global.memstore.size | Updates the size of all MemStores supported by the RegionServer before locking or forcible flush. It is recommended that you set this parameter to **hbase.hregion.memstore.flush.size x Number of regions with active writes/RegionServer GC -Xmx**. The default value is **0.4**, indicating that 40% of RegionServer GC -Xmx is used. | 0.4 | + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | hbase.hstore.flusher.count | Indicates the number of memstore flush threads. You can increase the parameter value in heavy-put-load scenarios. | 2 | + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | hbase.regionserver.thread.compaction.small | Indicates the number of small compaction threads. You can increase the parameter value in heavy-put-load scenarios. | 10 | + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | hbase.hstore.blockingStoreFiles | If the number of HStoreFile files in a Store exceeds the specified value, the update of the HRegion will be locked until a compression is completed or the value of **base.hstore.blockingWaitTime** is exceeded. Each time MemStore is flushed, a StoreFile file is written into MemStore. Set this parameter to a larger value in heavy-put-load scenarios. | 15 | + +--------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + +- Scan parameters + + .. _mrs_01_1016__tcd04a4cfd9f94a80a47de3ccb824175e: + + .. table:: **Table 2** Scan parameters + + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Default Value | + +=====================================+===============================================================================================================================================================================================================================================================================================================+=================================================================================================================+ + | hbase.client.scanner.timeout.period | Client and RegionServer parameters, indicating the lease timeout period of the client executing the scan operation. You are advised to set this parameter to an integer multiple of 60000 ms. You can set this parameter to a larger value when the read load is heavy. The unit is milliseconds. | 60000 | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+ + | hfile.block.cache.size | Indicates the data cache percentage in the RegionServer GC -Xmx. You can increase the parameter value in heavy-read-load scenarios, in order to improve cache hit ratio and performance. It indicates the percentage of the maximum heap (-Xmx setting) allocated to the block cache of HFiles or StoreFiles. | When offheap is disabled, the default value is **0.25**. When offheap is enabled, the default value is **0.1**. | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+ + +- Handler parameters + + .. table:: **Table 3** Handler parameters + + +--------------------------------------+------------------------------------------------------------------------------------------------------------------------------+---------------+ + | Parameter | Description | Default Value | + +======================================+==============================================================================================================================+===============+ + | hbase.regionserver.handler.count | Indicates the number of RPC server instances on RegionServer. The recommended value ranges from 200 to 400. | 200 | + +--------------------------------------+------------------------------------------------------------------------------------------------------------------------------+---------------+ + | hbase.regionserver.metahandler.count | Indicates the number of program instances for processing prioritized requests. The recommended value ranges from 200 to 400. | 200 | + +--------------------------------------+------------------------------------------------------------------------------------------------------------------------------+---------------+ diff --git a/doc/component-operation-guide/source/using_hbase/index.rst b/doc/component-operation-guide/source/using_hbase/index.rst new file mode 100644 index 0000000..1444dbb --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/index.rst @@ -0,0 +1,50 @@ +:original_name: mrs_01_0500.html + +.. _mrs_01_0500: + +Using HBase +=========== + +- :ref:`Using HBase from Scratch ` +- :ref:`Using an HBase Client ` +- :ref:`Creating HBase Roles ` +- :ref:`Configuring HBase Replication ` +- :ref:`Configuring HBase Parameters ` +- :ref:`Enabling Cross-Cluster Copy ` +- :ref:`Using the ReplicationSyncUp Tool ` +- :ref:`GeoMesa Command Line ` +- :ref:`Configuring HBase DR ` +- :ref:`Configuring HBase Data Compression and Encoding ` +- :ref:`Performing an HBase DR Service Switchover ` +- :ref:`Performing an HBase DR Active/Standby Cluster Switchover ` +- :ref:`Community BulkLoad Tool ` +- :ref:`Configuring the MOB ` +- :ref:`Configuring Secure HBase Replication ` +- :ref:`Configuring Region In Transition Recovery Chore Service ` +- :ref:`HBase Log Overview ` +- :ref:`HBase Performance Tuning ` +- :ref:`Common Issues About HBase ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + using_hbase_from_scratch + using_an_hbase_client + creating_hbase_roles + configuring_hbase_replication + configuring_hbase_parameters + enabling_cross-cluster_copy + using_the_replicationsyncup_tool + geomesa_command_line + configuring_hbase_dr + configuring_hbase_data_compression_and_encoding + performing_an_hbase_dr_service_switchover + performing_an_hbase_dr_active_standby_cluster_switchover + community_bulkload_tool + configuring_the_mob + configuring_secure_hbase_replication + configuring_region_in_transition_recovery_chore_service + hbase_log_overview + hbase_performance_tuning/index + common_issues_about_hbase/index diff --git a/doc/component-operation-guide/source/using_hbase/performing_an_hbase_dr_active_standby_cluster_switchover.rst b/doc/component-operation-guide/source/using_hbase/performing_an_hbase_dr_active_standby_cluster_switchover.rst new file mode 100644 index 0000000..50ec9ed --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/performing_an_hbase_dr_active_standby_cluster_switchover.rst @@ -0,0 +1,82 @@ +:original_name: mrs_01_1611.html + +.. _mrs_01_1611: + +Performing an HBase DR Active/Standby Cluster Switchover +======================================================== + +Scenario +-------- + +The HBase cluster in the current environment is a DR cluster. Due to some reasons, the active and standby clusters need to be switched over. That is, the standby cluster becomes the active cluster, and the active cluster becomes the standby cluster. + +.. note:: + + This section applies to MRS 3.\ *x* or later clusters. + +Impact on the System +-------------------- + +After the active and standby clusters are switched over, data cannot be written to the original active cluster, and the original standby cluster becomes the active cluster to take over upper-layer services. + +Procedure +--------- + +**Ensuring that upper-layer services are stopped** + +#. Ensure that the upper-layer services have been stopped. If not, perform operations by referring to :ref:`Performing an HBase DR Service Switchover `. + +**Disabling the write function of the active cluster** + +2. Download and install the HBase client. + +3. On the HBase client of the standby cluster, run the following command as user **hbase** to disable the data write function of the standby cluster: + + **kinit hbase** + + **hbase shell** + + **set_clusterState_standby** + + The command is run successfully if the following information is displayed: + + .. code-block:: + + hbase(main):001:0> set_clusterState_standby + => true + +**Checking whether the active/standby synchronization is complete** + +4. Run the following command to ensure that the current data has been synchronized (SizeOfLogQueue=0 and SizeOfLogToReplicate=0 are required). If the values are not 0, wait and run the following command repeatedly until the values are 0. + + **status 'replication'** + +**Disabling synchronization between the active and standby clusters** + +5. Query all synchronization clusters and obtain the value of **PEER_ID**. + + **list_peers** + +6. Delete all synchronization clusters. + + **remove_peer** *'Standby cluster ID'* + + Example: + + **remove_peer** **'**\ *\ 1\ *\ **'** + +7. Query all synchronized tables. + + **list_replicated_tables** + +8. Disable all synchronized tables queried in the preceding step. + + **disable_table_replication** *'Table name'* + + Example: + + **disable_table_replication** *'t1'* + +**Performing an active/standby switchover** + +9. Reconfigure HBase DR. For details, see :ref:`Configuring HBase DR `. diff --git a/doc/component-operation-guide/source/using_hbase/performing_an_hbase_dr_service_switchover.rst b/doc/component-operation-guide/source/using_hbase/performing_an_hbase_dr_service_switchover.rst new file mode 100644 index 0000000..bd6c177 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/performing_an_hbase_dr_service_switchover.rst @@ -0,0 +1,84 @@ +:original_name: mrs_01_1610.html + +.. _mrs_01_1610: + +Performing an HBase DR Service Switchover +========================================= + +Scenario +-------- + +The system administrator can configure HBase cluster DR to improve system availability. If the active cluster in the DR environment is faulty and the connection to the HBase upper-layer application is affected, you need to configure the standby cluster information for the HBase upper-layer application so that the application can run in the standby cluster. + +.. note:: + + This section applies to MRS 3.\ *x* or later clusters. + +Impact on the System +-------------------- + +After a service switchover, data written to the standby cluster is not synchronized to the active cluster by default. Add the active cluster is recovered, the data newly generated in the standby cluster needs to be synchronized to the active cluster by backup and recovery. If automatic data synchronization is required, you need to switch over the active and standby HBase DR clusters. + +Procedure +--------- + +#. Log in to FusionInsight Manager of the standby cluster. + +#. Download and install the HBase client. + +#. On the HBase client of the standby cluster, run the following command as user **hbase** to enable the data writing status in the standby cluster. + + **kinit hbase** + + **hbase shell** + + **set_clusterState_active** + + The command is run successfully if the following information is displayed: + + .. code-block:: + + hbase(main):001:0> set_clusterState_active + => true + +#. Check whether the original configuration files **hbase-site.xml**, **core-site.xml**, and **hdfs-site.xml** of the HBase upper-layer application are modified to adapt to the application running. + + - If yes, update the related content to the new configuration file and replace the old configuration file. + - If no, use the new configuration file to replace the original configuration file of the HBase upper-layer application. + +#. Configure the network connection between the host where the HBase upper-layer application is located and the standby cluster. + + .. note:: + + If the host where the client is installed is not a node in the cluster, configure network connections for the client to prevent errors when you run commands on the client. + + a. Ensure that the host where the client is installed can communicate with the hosts listed in the **hosts** file in the directory where the client installation package is decompressed. + b. If the host where the client is located is not a node in the cluster, you need to set the mapping between the host name and the IP address (service plan) in the /etc/hosts file on the host. The host names and IP addresses must be mapped one by one. + +#. Set the time of the host where the HBase upper-layer application is located to be the same as that of the standby cluster. The time difference must be less than 5 minutes. + +#. Check the authentication mode of the active cluster. + + - If the security mode is used, go to :ref:`8 `. + - If the normal mode is used, no further action is required. + +#. .. _mrs_01_1610__l5002f6a291d5455895e03939d56eae5c: + + Obtain the **keytab** and **krb5.conf** configuration files of the HBase upper-layer application user. + + a. On FusionInsight Manager of the standby cluster, choose **System** > **Permission** > **User**. + b. Locate the row that contains the target user, click **More** > **Download Authentication Credential** in the **Operation** column, and download the **keytab** file to the local PC. + c. Decompress the package to obtain **user.keytab** and **krb5.conf**. + +#. Use the **user.keytab** and **krb5.conf** files to replace the original files in the HBase upper-layer application. + +#. Stop upper-layer applications. + +#. Determine whether to switch over the active and standby HBase clusters. If the switchover is not performed, data will not be synchronized. + + - If yes, switch over the active and standby HBase DR clusters. For details, see :ref:`Performing an HBase DR Active/Standby Cluster Switchover `. Then, go to :ref:`12 `. + - If no, go to :ref:`12 `. + +#. .. _mrs_01_1610__li11189185214483: + + Start the upper-layer services. diff --git a/doc/component-operation-guide/source/using_hbase/using_an_hbase_client.rst b/doc/component-operation-guide/source/using_hbase/using_an_hbase_client.rst new file mode 100644 index 0000000..02d2929 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/using_an_hbase_client.rst @@ -0,0 +1,103 @@ +:original_name: bakmrs_01_0368.html + +.. _bakmrs_01_0368: + +Using an HBase Client +===================== + +Scenario +-------- + +This section describes how to use the HBase client in an O&M scenario or a service scenario. + +Prerequisites +------------- + +- The client has been installed. For example, the installation directory is **/opt/hadoopclient**. The client directory in the following operations is only an example. Change it to the actual installation directory. + +- Service component users are created by the administrator as required. + + A machine-machine user needs to download the **keytab** file and a human-machine user needs to change the password upon the first login. + +- If a non-**root** user uses the HBase client, ensure that the owner of the HBase client directory is this user. Otherwise, run the following command to change the owner. + + **chown user:group -R** *Client installation directory*\ **/HBase** + +Using the HBase Client (Versions Earlier Than MRS 3.x) +------------------------------------------------------ + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client directory: + + **cd** **/opt/hadoopclient** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. The current user must have the permission to create HBase tables. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *Component service user* + + For example, **kinit hbaseuser**. + +#. Run the following HBase client command: + + **hbase shell** + +Using the HBase Client (MRS 3.x or Later) +----------------------------------------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client directory: + + **cd** **/opt/hadoopclient** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If you use the client to connect to a specific HBase instance in a scenario where multiple HBase instances are installed, run the following command to load the environment variables of the instance. Otherwise, skip this step. For example, to load the environment variables of the HBase2 instance, run the following command: + + **source HBase2/component_env** + +#. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. The current user must have the permission to create HBase tables. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *Component service user* + + For example, **kinit hbaseuser**. + +#. Run the following HBase client command: + + **hbase shell** + +Common HBase client commands +---------------------------- + +The following table lists common HBase client commands. + +.. table:: **Table 1** HBase client commands + + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Command | Description | + +==========+=================================================================================================================================================================================================================================+ + | create | Used to create a table, for example, **create 'test', 'f1', 'f2', 'f3'**. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | disable | Used to disable a specified table, for example, **disable 'test'**. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | enable | Used to enable a specified table, for example, **enable 'test'**. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | alter | Used to alter the table structure. You can run the **alter** command to add, modify, or delete column family information and table-related parameter values, for example, **alter 'test', {NAME => 'f3', METHOD => 'delete'}**. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | describe | Used to obtain the table description, for example, **describe 'test'**. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | drop | Used to delete a specified table, for example, **drop 'test'**. Before deleting a table, you must stop it. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | put | Used to write the value of a specified cell, for example, **put 'test','r1','f1:c1','myvalue1'**. The cell location is unique and determined by the table, row, and column. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | get | Used to get the value of a row or the value of a specified cell in a row, for example, **get 'test','r1'**. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | scan | Used to query table data, for example, **scan 'test'**. The table name and scanner must be specified in the command. | + +----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_hbase/using_hbase_from_scratch.rst b/doc/component-operation-guide/source/using_hbase/using_hbase_from_scratch.rst new file mode 100644 index 0000000..0c67283 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/using_hbase_from_scratch.rst @@ -0,0 +1,256 @@ +:original_name: mrs_01_0368.html + +.. _mrs_01_0368: + +Using HBase from Scratch +======================== + +HBase is a column-based distributed storage system that features high reliability, performance, and scalability. This section describes how to use HBase from scratch, including how to update the client on the Master node in the cluster, create a table using the client, insert data in the table, modify the table, read data from the table, delete table data, and delete the table. + +Background +---------- + +Suppose a user develops an application to manage users who use service A in an enterprise. The procedure of operating service A on the HBase client is as follows: + +- Create the **user_info** table. +- Add users' educational backgrounds and titles to the table. +- Query user names and addresses by user ID. +- Query information by user name. +- Deregister users and delete user data from the user information table. +- Delete the user information table after service A ends. + +.. _mrs_01_0368__en-us_topic_0229422393_en-us_topic_0173178212_en-us_topic_0037446806_table27353390: + +.. table:: **Table 1** User information + + =========== ==== ====== === ======= + ID Name Gender Age Address + =========== ==== ====== === ======= + 12005000201 A Male 19 City A + 12005000202 B Female 23 City B + 12005000203 C Male 26 City C + 12005000204 D Male 18 City D + 12005000205 E Female 21 City E + 12005000206 F Male 32 City F + 12005000207 G Female 29 City G + 12005000208 H Female 30 City H + 12005000209 I Male 26 City I + 12005000210 J Male 25 City J + =========== ==== ====== === ======= + +Prerequisites +------------- + +The client has been installed. For example, the client is installed in the **/opt/client** directory. The client directory in the following operations is only an example. Change it to the actual installation directory. Before using the client, download and update the client configuration file, and ensure that the active management node of Manager is available. + +Procedure +--------- + +For versions earlier than MRS 3.x, perform the following operations: + +#. .. _mrs_01_0368__en-us_topic_0229422393_en-us_topic_0173178212_l6b58a848ef0f4fe6a361d4ef0ac39fb8: + + Download the client configuration file. + + a. Log in to MRS Manager. For details, see :ref:`Accessing Manager `. Then, choose **Services**. + + b. Click **Download Client**. + + Set **Client Type** to **Only configuration files**, **Download To** to **Server**, and click **OK** to generate the client configuration file. The generated file is saved in the **/tmp/MRS-client** directory on the active management node by default. You can customize the file path. + +#. Log in to the active management node of MRS Manager. + + a. On the **Node** tab page, view the **Name** parameter. The node that contains **master1** in its name is the Master1 node. The node that contains **master2** in its name is the Master2 node. + + The active and standby management nodes of MRS Manager are installed on Master nodes by default. Because Master1 and Master2 are switched over in active and standby mode, Master1 is not always the active management node of MRS Manager. Run a command in Master1 to check whether Master1 is active management node of MRS Manager. For details about the command, see :ref:`2.d `. + + b. Log in to the Master1 node using the password as user **root**. For details, see `Logging In to an ECS `__. + + c. Run the following commands to switch to user **omm**: + + **sudo su - root** + + **su - omm** + + d. .. _mrs_01_0368__en-us_topic_0229422393_en-us_topic_0173178212_le8e7045cece741e8b6209b929a50ff22: + + Run the following command to check the active management node of MRS Manager: + + **sh ${BIGDATA_HOME}/om-0.0.1/sbin/status-oms.sh** + + In the command output, the node whose **HAActive** is **active** is the active management node, and the node whose **HAActive** is **standby** is the standby management node. In the following example, **mgtomsdat-sh-3-01-1** is the active management node, and **mgtomsdat-sh-3-01-2** is the standby management node. + + .. code-block:: + + Ha mode + double + NodeName HostName HAVersion StartTime HAActive HAAllResOK HARunPhase + 192-168-0-30 mgtomsdat-sh-3-01-1 V100R001C01 2019-11-18 23:43:02 active normal Actived + 192-168-0-24 mgtomsdat-sh-3-01-2 V100R001C01 2019-11-21 07:14:02 standby normal Deactived + + e. Log in to the active management node, for example, **192-168-0-30** of MRS Manager as user **root**, and run the following command to switch to user **omm**: + + **sudo su - omm** + +#. Run the following command to switch to the client installation directory, for example, **/opt/client**: + + **cd /opt/client** + +#. .. _mrs_01_0368__en-us_topic_0229422393_en-us_topic_0173178212_lc39cdd52f6ac479ab273ecabbffd083b: + + Run the following command to update the client configuration for the active management node. + + **sh refreshConfig.sh /opt/client** *Full path of the client configuration file package* + + For example, run the following command: + + **sh refreshConfig.sh /opt/client /tmp/MRS-client/MRS_Services_Client.tar** + + If the following information is displayed, the configurations have been updated successfully. + + .. code-block:: + + ReFresh components client config is complete. + Succeed to refresh components client config. + + .. note:: + + You can refer to steps :ref:`1 ` to :ref:`4 ` or Method 2 in `Updating a Client `__. + +#. Use the client on a Master node. + + a. On the active management node where the client is updated, for example, node **192-168-0-30**, run the following command to go to the client directory: + + **cd /opt/client** + + b. Run the following command to configure environment variables: + + **source bigdata_env** + + c. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. The current user must have the permission to create HBase tables. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *MRS cluster user* + + For example, **kinit hbaseuser**. + + d. Run the following HBase client command: + + **hbase shell** + +#. Run the following commands on the HBase client to implement service A. + + a. Create the **user_info** user information table according to :ref:`Table 1 ` and add data to it. + + **create** '*user_info*',{**NAME** => 'i'} + + For example, to add information about the user whose ID is 12005000201, run the following commands: + + **put** '*user_info*','*12005000201*','**i:name**','*A*' + + **put** '*user_info*','*12005000201*','**i:gender**','*Male*' + + **put** '*user_info*','*12005000201*','**i:age**','*19*' + + **put** '*user_info*','*12005000201*','**i:address**','*City A*' + + b. Add users' educational backgrounds and titles to the **user_info** table. + + For example, to add educational background and title information about user 12005000201, run the following commands: + + **put** '*user_info*','*12005000201*','**i:degree**','*master*' + + **put** '*user_info*','*12005000201*','**i:pose**','*manager*' + + c. Query user names and addresses by user ID. + + For example, to query the name and address of user 12005000201, run the following command: + + **scan**'*user_info*',{**STARTROW**\ =>'*12005000201*',\ **STOPROW**\ =>'*12005000201*',\ **COLUMNS**\ =>['**i:name**','**i:address**']} + + d. Query information by user name. + + For example, to query information about user A, run the following command: + + **scan**'*user_info*',{**FILTER**\ =>"SingleColumnValueFilter('i','name',=,'binary:*A*')"} + + e. Delete user data from the user information table. + + All user data needs to be deleted. For example, to delete data of user 12005000201, run the following command: + + **delete**'*user_info*','*12005000201*','i' + + f. Delete the user information table. + + **disable**'*user_info*' + + **drop** '*user_info*' + +For MRS 3.x or later, perform the following operations: + +#. Use the client on the active management node. + + a. Log in to the node where the client is installed as the client installation user and run the following command to switch to the client directory: + + **cd /opt/client** + + b. Run the following command to configure environment variables: + + **source bigdata_env** + + c. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user. The current user must have the permission to create HBase tables. If Kerberos authentication is disabled for the current cluster, skip this step. + + **kinit** *MRS cluster user* + + For example, **kinit hbaseuser**. + + d. Run the following HBase client command: + + **hbase shell** + +#. Run the following commands on the HBase client to implement service A. + + a. Create the **user_info** user information table according to :ref:`Table 1 ` and add data to it. + + **create** '*user_info*',{**NAME** => 'i'} + + For example, to add information about the user whose ID is **12005000201**, run the following commands: + + **put** '*user_info*','*12005000201*','**i:name**','*A*' + + **put** '*user_info*','*12005000201*','**i:gender**','*Male*' + + **put** '*user_info*','*12005000201*','**i:age**','*19*' + + **put** '*user_info*','*12005000201*','**i:address**','*City A*' + + b. Add users' educational backgrounds and titles to the **user_info** table. + + For example, to add educational background and title information about user 12005000201, run the following commands: + + **put** '*user_info*','*12005000201*','**i:degree**','*master*' + + **put** '*user_info*','*12005000201*','**i:pose**','*manager*' + + c. Query user names and addresses by user ID. + + For example, to query the name and address of user 12005000201, run the following command: + + **scan**'*user_info*',{**STARTROW**\ =>'*12005000201*',\ **STOPROW**\ =>'*12005000201*',\ **COLUMNS**\ =>['**i:name**','**i:address**']} + + d. Query information by user name. + + For example, to query information about user A, run the following command: + + **scan**'*user_info*',{**FILTER**\ =>"SingleColumnValueFilter('i','name',=,'binary:*A*')"} + + e. Delete user data from the user information table. + + All user data needs to be deleted. For example, to delete data of user 12005000201, run the following command: + + **delete**'*user_info*','*12005000201*','i' + + f. Delete the user information table. + + **disable**'*user_info*' + + **drop** '*user_info*' diff --git a/doc/component-operation-guide/source/using_hbase/using_the_replicationsyncup_tool.rst b/doc/component-operation-guide/source/using_hbase/using_the_replicationsyncup_tool.rst new file mode 100644 index 0000000..e072471 --- /dev/null +++ b/doc/component-operation-guide/source/using_hbase/using_the_replicationsyncup_tool.rst @@ -0,0 +1,56 @@ +:original_name: mrs_01_0510.html + +.. _mrs_01_0510: + +Using the ReplicationSyncUp Tool +================================ + +Prerequisites +------------- + +#. Active and standby clusters have been installed and started. +#. Time is consistent between the active and standby clusters and the NTP service on the active and standby clusters uses the same time source. +#. When the HBase service of the active cluster is stopped, the ZooKeeper and HDFS services must be started and run. +#. ReplicationSyncUp must be run by the system user who starts the HBase process. +#. In security mode, ensure that the HBase system user of the standby cluster has the read permission on HDFS of the active cluster. This is because that it will update the ZooKeeper nodes and HDFS files of the HBase system. +#. When HBase of the active cluster is faulty, the ZooKeeper, file system, and network of the active cluster are still available. + +Scenarios +--------- + +The replication mechanism can use WAL to synchronize the state of a cluster with the state of another cluster. After HBase replication is enabled, if the active cluster is faulty, ReplicationSyncUp synchronizes incremental data from the active cluster to the standby cluster using the information from the ZooKeeper node. After data synchronization is complete, the standby cluster can be used as an active cluster. + +Parameter Configuration +----------------------- + ++------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ +| Parameter | Description | Default Value | ++====================================+=========================================================================================================================================================================================================+===============+ +| hbase.replication.bulkload.enabled | Whether to enable the bulkload data replication function. The parameter value type is Boolean. To enable the bulkload data replication function, set this parameter to **true** for the active cluster. | **false** | ++------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ +| hbase.replication.cluster.id | ID of the source HBase cluster. After the bulkload data replication is enabled, this parameter is mandatory and must be defined in the source cluster. The parameter value type is String. | ``-`` | ++------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ + +Tool Usage +---------- + +Run the following command on the client of the active cluster: + +**hbase org.apache.hadoop.hbase.replication.regionserver.ReplicationSyncUp -Dreplication.sleep.before.failover=1** + +.. note:: + + **replication.sleep.before.failover** indicates sleep time required for replication of the remaining data when RegionServer fails to start. You are advised to set this parameter to 1 second to quickly trigger replication. + +Precautions +----------- + +#. When the active cluster is stopped, this tool obtains the WAL processing progress and WAL processing queue from the ZooKeeper Node (RS znode) and copies the queues that are not copied to the standby cluster. +#. RegionServer of each active cluster has its own znode under the replication node of ZooKeeper in the standby cluster. It contains one znode of each peer cluster. +#. If RegionServer is faulty, each RegionServer in the active cluster receives a notification through the watcher and attempts to lock the znode of the faulty RegionServer, including its queues. The successfully created RegionServer transfers all queues to the znode of its own queue. After queues are transferred, they are deleted from the old location. +#. When the active cluster is stopped, ReplicationSyncUp synchronizes data between active and standby clusters using the information from the ZooKeeper node. In addition, WALs of the RegionServer znode will be moved to the standby cluster. + +Restrictions and Limitations +---------------------------- + +If the standby cluster is stopped or the peer relationship is closed, the tool runs normally but the peer relationship cannot be replicated. diff --git a/doc/component-operation-guide/source/using_hdfs/balancing_datanode_capacity.rst b/doc/component-operation-guide/source/using_hdfs/balancing_datanode_capacity.rst new file mode 100644 index 0000000..b9558f5 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/balancing_datanode_capacity.rst @@ -0,0 +1,188 @@ +:original_name: mrs_01_1667.html + +.. _mrs_01_1667: + +Balancing DataNode Capacity +=========================== + +Scenario +-------- + +.. note:: + + This section applies to MRS 3.\ *x* or later clusters. + +In the HDFS cluster, unbalanced disk usage among DataNodes may occur, for example, when new DataNodes are added to the cluster. Unbalanced disk usage may result in multiple problems. For example, MapReduce applications cannot make full use of local computing advantages, network bandwidth usage between data nodes cannot be optimal, or node disks cannot be used. Therefore, the system administrator needs to periodically check and maintain DataNode data balance. + +HDFS provides a capacity balancing program Balancer. By running Balancer, you can balance the HDFS cluster and ensure that the difference between the disk usage of each DataNode and that of the HDFS cluster does not exceed the threshold. DataNode disk usage before and after balancing is shown in :ref:`Figure 1 ` and :ref:`Figure 2 `, respectively. + +.. _mrs_01_1667__ff269b77c9222460985503fffcebf980e: + +.. figure:: /_static/images/en-us_image_0000001349090497.png + :alt: **Figure 1** DataNode disk usage before balancing + + **Figure 1** DataNode disk usage before balancing + +.. _mrs_01_1667__fee19cefb9d104f238448abdcf62f1e49: + +.. figure:: /_static/images/en-us_image_0000001296250300.png + :alt: **Figure 2** DataNode disk usage after balancing + + **Figure 2** DataNode disk usage after balancing + +The time of the balancing operation is affected by the following two factors: + +#. Total amount of data to be migrated: + + The data volume of each DataNode must be greater than (Average usage - Threshold) x Average data volume and less than (Average usage + Threshold) x Average data volume. If the actual data volume is less than the minimum value or greater than the maximum value, imbalance occurs. The system sets the largest deviation volume on all DataNodes as the total data volume to be migrated. + +#. Balancer migration is performed in sequence in iteration mode. The amount of data to be migrated in each iteration does not exceed 10 GB, and the usage of each iteration is recalculated. + +Therefore, for a cluster, you can estimate the time consumed by each iteration (by observing the time consumed by each iteration recorded in balancer logs) and divide the total data volume by 10 GB to estimate the task execution time. + +The balancer can be started or stopped at any time. + +Impact on the System +-------------------- + +- The balance operation occupies network bandwidth resources of DataNodes. Perform the operation during maintenance based on service requirements. +- The balance operation may affect the running services if the bandwidth traffic (the default bandwidth control is 20 MB/s) is reset or the data volume is increased. + +Prerequisites +------------- + +The client has been installed. + +Procedure +--------- + +#. Log in to the node where the client is installed as a client installation user. Run the following command to switch to the client installation directory, for example, **/opt/client**: + + **cd /opt/client** + + .. note:: + + If the cluster is in normal mode, run the **su - omm** command to switch to user **omm**. + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If the cluster is in security mode, run the following command to authenticate the HDFS identity: + + **kinit hdfs** + +#. Determine whether to adjust the bandwidth control. + + - If yes, go to :ref:`5 `. + - If no, go to :ref:`6 `. + +#. .. _mrs_01_1667__l91d088e58d8d4bbbb720317d843ca5d3: + + Run the following command to change the maximum bandwidth of Balancer, and then go to :ref:`6 `. + + **hdfs dfsadmin -setBalancerBandwidth ** + + ** indicates the bandwidth control value, in bytes. For example, to set the bandwidth control to 20 MB/s (the corresponding value is 20971520), run the following command: + + **hdfs dfsadmin -setBalancerBandwidth** **20971520** + + .. note:: + + - The default bandwidth control is 20 MB/s. This value is applicable to the scenario where the current cluster uses the 10GE network and services are being executed. If the service idle time window is insufficient for balance maintenance, you can increase the value of this parameter to shorten the balance time, for example, to 209715200 (200 MB/s). + - The value of this parameter depends on the networking. If the cluster load is high, you can change the value to 209715200 (200 MB/s). If the cluster is idle, you can change the value to 1073741824 (1 GB/s). + - If the bandwidth of the DataNodes cannot reach the specified maximum bandwidth, modify the HDFS parameter **dfs.datanode.balance.max.concurrent.moves** on FusionInsight Manager, and change the number of threads for balancing on each DataNode to **32** and restart the HDFS service. + +#. .. _mrs_01_1667__ld8ed77b8b7c745308eea6a68de2f4233: + + Run the following command to start the balance task: + + **bash /opt/client/HDFS/hadoop/sbin/start-balancer.sh -threshold ** + + **-threshold** specifies the deviation value of the DataNode disk usage, which is used for determining whether the HDFS data is balanced. When the difference between the disk usage of each DataNode and the average disk usage of the entire HDFS cluster is less than this threshold, the system considers that the HDFS cluster has been balanced and ends the balance task. + + For example, to set deviation rate to 5%, run the following command: + + **bash /opt/client/HDFS/hadoop/sbin/start-balancer.sh -threshold 5** + + .. note:: + + - The preceding command executes the task in the background. You can query related logs in the **hadoop-root-balancer-**\ *host name*\ **.out log** file in the **/opt/client/HDFS/hadoop/logs** directory of the host. + + - To stop the balance task, run the following command: + + **bash /opt/client/HDFS/hadoop/sbin/stop-balancer.sh** + + - If only data on some nodes needs to be balanced, you can add the **-include** parameter in the script to specify the nodes to be migrated. You can run commands to view the usage of different parameters. + + - **/opt/client** is the client installation directory. If the directory is inconsistent, replace it. + + - If the command fails to be executed and the error information **Failed to APPEND_FILE /system/balancer.id** is displayed in the log, run the following command to forcibly delete **/system/balancer.id** and run the **start-balancer.sh** script again: + + **hdfs dfs -rm -f /system/balancer.id** + +#. After you run the script in :ref:`6 `, the **hadoop-root-balancer-**\ *Host name*\ **.out log** file is generated in **/opt/client/HDFS/hadoop/logs**, the client installation directory. You can view the following information in the log: + + - Time Stamp + - Bytes Already Moved + - Bytes Left To Move + - Bytes Being Moved + + If message "Balance took *xxx* seconds" is displayed in the log, the balancing operation is complete. + +Related Tasks +------------- + +**Enable automatic execution of the balance task** + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **Configurations**, select **All Configurations**, search for the following parameters, and change the parameter values. + + - **dfs.balancer.auto.enable** indicates whether to enable automatic balance task execution. The default value **false** indicates that automatic balance task execution is disabled. The value **true** indicates that automatic execution is enabled. + + - **dfs.balancer.auto.cron.expression** indicates the task execution time. The default value **0 1 \* \* 6** indicates that the task is executed at 01:00 every Saturday. This parameter is valid only when the automatic execution is enabled. + + :ref:`Table 1 ` describes the expression for modifying this parameter. **\*** indicates consecutive time segments. + + .. _mrs_01_1667__t3d64fdb3254a42beaed3c5e4c7087501: + + .. table:: **Table 1** Parameters in the execution expression + + ====== =========================================================== + Column Description + ====== =========================================================== + 1 Minute. The value ranges from 0 to 59. + 2 Hour. The value ranges from 0 to 23. + 3 Date. The value ranges from 1 to 31. + 4 Month. The value ranges from 1 to 12. + 5 Week. The value ranges from 0 to 6. **0** indicates Sunday. + ====== =========================================================== + + - **dfs.balancer.auto.stop.cron.expression** indicates the task ending time. The default value is empty, indicating that the running balance task is not automatically stopped. For example, **0 5 \* \* 6** indicates that the balance task is stopped at 05:00 every Saturday. This parameter is valid only when the automatic execution is enabled. + + :ref:`Table 1 ` describes the expression for modifying this parameter. **\*** indicates consecutive time segments. + +#. Running parameters of the balance task that is automatically executed are shown in :ref:`Table 2 `. + + .. _mrs_01_1667__tc3bff391b0c14479916d9097f5e28238: + + .. table:: **Table 2** Running parameters of the automatic balancer + + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+ + | Parameter | Parameter description | Default Value | + +=====================================+===========================================================================================================================================================================================================================================================================================================================================================================+=====================================+ + | dfs.balancer.auto.threshold | Specifies the balancing threshold of the disk capacity percentage. This parameter is valid only when **dfs.balancer.auto.enable** is set to **true**. | 10 | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+ + | dfs.balancer.auto.exclude.datanodes | Specifies the list of DataNodes on which automatic disk balancing is not required. This parameter is valid only when **dfs.balancer.auto.enable** is set to **true**. | The value is left blank by default. | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+ + | dfs.balancer.auto.bandwidthPerSec | Specifies the maximum bandwidth (MB/s) of each DataNode for load balancing. | 20 | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+ + | dfs.balancer.auto.maxIdleIterations | Specifies the maximum number of consecutive idle iterations of Balancer. An idle iteration is an iteration without moving blocks. When the number of consecutive idle iterations reaches the maximum number, the balance task ends. The value **-1** indicates infinity. | 5 | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+ + | dfs.balancer.auto.maxDataNodesNum | Controls the number of DataNodes that perform automatic balance tasks. Assume that the value of this parameter is *N*. If *N* is greater than 0, data is balanced between *N* DataNodes with the highest percentage of remaining space and *N* DataNodes with the lowest percentage of remaining space. If *N* is 0, data is balanced among all DataNodes in the cluster. | 5 | + +-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+ + +#. Click **Save** to make configurations take effect. You do not need to restart the HDFS service. + + Go to the **/var/log/Bigdata/hdfs/nn/hadoop-omm-balancer-**\ *Host name*\ **.log** file to view the task execution logs saved in the active NameNode. diff --git a/doc/component-operation-guide/source/using_hdfs/changing_the_datanode_storage_directory.rst b/doc/component-operation-guide/source/using_hdfs/changing_the_datanode_storage_directory.rst new file mode 100644 index 0000000..9a0b9b6 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/changing_the_datanode_storage_directory.rst @@ -0,0 +1,175 @@ +:original_name: mrs_01_1664.html + +.. _mrs_01_1664: + +Changing the DataNode Storage Directory +======================================= + +Scenario +-------- + +.. note:: + + This section applies to MRS 3.\ *x* or later clusters. + +If the storage directory defined by the HDFS DataNode is incorrect or the HDFS storage plan changes, the system administrator needs to modify the DataNode storage directory on FusionInsight Manager to ensure that the HDFS works properly. Changing the ZooKeeper storage directory includes the following scenarios: + +- Change the storage directory of the DataNode role. In this way, the storage directories of all DataNode instances are changed. +- Change the storage directory of a single DataNode instance. In this way, only the storage directory of this instance is changed, and the storage directories of other instances remain the same. + +Impact on the System +-------------------- + +- The HDFS service needs to be stopped and restarted during the process of changing the storage directory of the DataNode role, and the cluster cannot provide services before it is completely started. + +- The DataNode instance needs to stopped and restarted during the process of changing the storage directory of the instance, and the instance at this node cannot provide services before it is started. +- The directory for storing service parameter configurations must also be updated. + +Prerequisites +------------- + +- New disks have been prepared and installed on each data node, and the disks are formatted. + +- New directories have been planned for storing data in the original directories. +- The HDFS client has been installed. +- The system administrator user **hdfs** is available. +- When changing the storage directory of a single DataNode instance, ensure that the number of active DataNode instances is greater than the value of **dfs.replication**. + +Procedure +--------- + +**Check the environment.** + +#. Log in to the server where the HDFS client is installed as user **root**, and run the following command to configure environment variables: + + **source** *Installation directory of the HDFS client*\ **/bigdata_env** + +#. If the cluster is in security mode, run the following command to authenticate the user: + + **kinit hdfs** + +#. Run the following command on the HDFS client to check whether all directories and files in the HDFS root directory are normal: + + **hdfs fsck /** + + Check the fsck command output. + + - If the following information is displayed, no file is lost or damaged. Go to :ref:`4 `. + + .. code-block:: + + The filesystem under path '/' is HEALTHY + + - If other information is displayed, some files are lost or damaged. Go to :ref:`5 `. + +#. .. _mrs_01_1664__le587d508c49b4837bcabd9bd9cf98bc4: + + Log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services**, and check whether **Running Status** of HDFS is **Normal**. + + - If yes, go to :ref:`6 `. + - If no, the HDFS status is unhealthy. Go to :ref:`5 `. + +#. .. _mrs_01_1664__l1ce08f0a7d2349b487dd6f19c38c7273: + + Rectify the HDFS fault.. The task is complete. + +#. .. _mrs_01_1664__lff55f0ef8699449ab4cfc4eddeed1711: + + Determine whether to change the storage directory of the DataNode role or that of a single DataNode instance: + + - To change the storage directory of the DataNode role, go to :ref:`7 `. + - To change the storage directory of a single DataNode instance, go to :ref:`12 `. + +**Changing the storage directory of the DataNode role** + +7. .. _mrs_01_1664__l4bc534684e1d4d3cb656e4ed55bb75af: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **Stop Instance** to stop the HDFS service. + +8. Log in to each data node where the HDFS service is installed as user **root** and perform the following operations: + + a. Create a target directory (**data1** and **data2** are original directories in the cluster). + + For example, to create a target directory **${BIGDATA_DATA_HOME}/hadoop/data3/dn**, run the following command: + + **mkdir** **-p ${BIGDATA_DATA_HOME}/hadoop/data3/dn** + + b. Mount the target directory to the new disk. For example, mount **${BIGDATA_DATA_HOME}/hadoop/data3** to the new disk. + + c. Modify permissions on the new directory. + + For example, to create a target directory **${BIGDATA_DATA_HOME}/hadoop/data3/dn**, run the following commands: + + **chmod 700** **${BIGDATA_DATA_HOME}/hadoop/data3/dn -R** and **chown omm:wheel** **${BIGDATA_DATA_HOME}/hadoop/data3/dn -R** + + d. .. _mrs_01_1664__l63f4856203e9425f9a23113c3d13f665: + + Copy the data to the target directory. + + For example, if the old directory is **${BIGDATA_DATA_HOME}/hadoop/data1/dn** and the target directory is **${BIGDATA_DATA_HOME}/hadoop/data3/dn**, run the following command: + + **cp -af** **${BIGDATA_DATA_HOME}/hadoop/data1/dn/\*** **${BIGDATA_DATA_HOME}/hadoop/data3/dn** + +9. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **Configurations** > **All Configurations** to go to the HDFS service configuration page. + + Change the value of **dfs.datanode.data.dir** from the default value **%{@auto.detect.datapart.dn}** to the new target directory, for example, **${BIGDATA_DATA_HOME}/hadoop/data3/dn**. + + For example, the original data storage directories are **/srv/BigData/hadoop/data1**, **/srv/BigData/hadoop/data2**. To migrate data from the **/srv/BigData/hadoop/data1** directory to the newly created **/srv/BigData/hadoop/data3** directory, replace the whole parameter with **/srv/BigData/hadoop/data2, /srv/BigData/hadoop/data3**. Separate multiple storage directories with commas (,). In this example, changed directories are **/srv/BigData/hadoop/data2**, **/srv/BigData/hadoop/data3**. + +10. Click **Save**. Choose **Cluster** > *Name of the desired cluster* > **Services**. On the page that is displayed, start the services that have been stopped. + +11. After the HDFS is started, run the following command on the HDFS client to check whether all directories and files in the HDFS root directory are correctly copied: + + **hdfs fsck /** + + Check the fsck command output. + + - If the following information is displayed, no file is lost or damaged, and data replication is successful. No further action is required. + + .. code-block:: + + The filesystem under path '/' is HEALTHY + + - If other information is displayed, some files are lost or damaged. In this case, check whether :ref:`8.d ` is correct and run the **hdfs fsck** *Name of the damaged file* **-delete** command. + +**Changing the storage directory of a single DataNode instance** + +12. .. _mrs_01_1664__lab34cabb4d324166acebeb18e1098884: + + Choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **Instance**. Select the HDFS instance whose storage directory needs to be modified, and choose **More** > **Stop Instance**. + +13. Log in to the DataNode node as user **root**, and perform the following operations: + + a. Create a target directory. + + For example, to create a target directory **${BIGDATA_DATA_HOME}/hadoop/data3/dn**, run the following command: + + **mkdir -p** **${BIGDATA_DATA_HOME}/hadoop/data3/dn** + + b. Mount the target directory to the new disk. + + For example, mount **${BIGDATA_DATA_HOME}/hadoop/data3** to the new disk. + + c. Modify permissions on the new directory. + + For example, to create a target directory **${BIGDATA_DATA_HOME}/hadoop/data3/dn**, run the following commands: + + **chmod 700** **${BIGDATA_DATA_HOME}/hadoop/data3/dn -R** and **chown omm:wheel** **${BIGDATA_DATA_HOME}/hadoop/data3/dn -R** + + d. Copy the data to the target directory. + + For example, if the old directory is **${BIGDATA_DATA_HOME}/hadoop/data1/dn** and the target directory is **${BIGDATA_DATA_HOME}/hadoop/data3/dn**, run the following command: + + **cp -af** **${BIGDATA_DATA_HOME}/hadoop/data1/dn/\*** **${BIGDATA_DATA_HOME}/hadoop/data3/dn** + +14. On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Service** > **HDFS** > **Instance**. Click the specified DataNode instance and go to the **Configurations** page. + + Change the value of **dfs.datanode.data.dir** from the default value **%{@auto.detect.datapart.dn}** to the new target directory, for example, **${BIGDATA_DATA_HOME}/hadoop/data3/dn**. + + For example, the original data storage directories are **/srv/BigData/hadoop/data1,/srv/BigData/hadoop/data2**. To migrate data from the **/srv/BigData/hadoop/data1** directory to the newly created **/srv/BigData/hadoop/data3** directory, replace the whole parameter with **/srv/BigData/hadoop/data2,/srv/BigData/hadoop/data3**. + +15. Click **Save**, and then click **OK**. + + **Operation succeeded** is displayed. click **Finish**. + +16. Choose **More** > **Restart Instance** to restart the DataNode instance. diff --git a/doc/component-operation-guide/source/using_hdfs/configuring_encrypted_channels.rst b/doc/component-operation-guide/source/using_hdfs/configuring_encrypted_channels.rst new file mode 100644 index 0000000..eb6cabf --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/configuring_encrypted_channels.rst @@ -0,0 +1,49 @@ +:original_name: mrs_01_0810.html + +.. _mrs_01_0810: + +Configuring Encrypted Channels +============================== + +Scenario +-------- + +Encrypted channel is an encryption protocol of remote procedure call (RPC) in HDFS. When a user invokes RPC, the user's login name will be transmitted to RPC through RPC head. Then RPC uses Simple Authentication and Security Layer (SASL) to determine an authorization protocol (Kerberos and DIGEST-MD5) to complete RPC authorization. When users deploy security clusters, they need to use encrypted channels and configure the following parameters. For details about the secure Hadoop RPC, visit https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-common/SecureMode.html#Data_Encryption_on_RPC. + +Configuration Description +------------------------- + +Go to the **All Configurations** page of HDFS and enter a parameter name in the search box by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + +.. table:: **Table 1** Parameter description + + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+ + | Parameter | Description | Default Value | + +=======================+==========================================================================================================================================================================================================================+================================+ + | hadoop.rpc.protection | .. important:: | - Security mode: privacy | + | | | - Normal mode: authentication | + | | NOTICE: | | + | | | | + | | - The setting takes effect only after the service is restarted. Rolling restart is not supported. | | + | | - After the setting, you need to download the client configuration again. Otherwise, the HDFS cannot provide the read and write services. | | + | | | | + | | Whether the RPC channels of each module in Hadoop are encrypted. The channels include: | | + | | | | + | | - RPC channels for clients to access HDFS | | + | | - RPC channels between modules in HDFS, for example, RPC channels between DataNode and NameNode | | + | | - RPC channels for clients to access Yarn | | + | | - RPC channels between NodeManager and ResourceManager | | + | | - RPC channels for Spark to access Yarn and HDFS | | + | | - RPC channels for MapReduce to access Yarn and HDFS | | + | | - RPC channels for HBase to access HDFS | | + | | | | + | | .. note:: | | + | | | | + | | You can set this parameter on the HDFS component configuration page. The parameter setting takes effect globally, that is, the setting of whether the RPC channel is encrypted takes effect on all modules in Hadoop. | | + | | | | + | | There are three encryption modes. | | + | | | | + | | - **authentication**: This is the default value in normal mode. In this mode, data is directly transmitted without encryption after being authenticated. This mode ensures performance but has security risks. | | + | | - **integrity**: Data is transmitted without encryption or authentication. To ensure data security, exercise caution when using this mode. | | + | | - **privacy**: This is the default value in security mode, indicating that data is transmitted after authentication and encryption. This mode reduces the performance. | | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/configuring_hdfs_directory_permission.rst b/doc/component-operation-guide/source/using_hdfs/configuring_hdfs_directory_permission.rst new file mode 100644 index 0000000..324b156 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/configuring_hdfs_directory_permission.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_0797.html + +.. _mrs_01_0797: + +Configuring HDFS Directory Permission +===================================== + +Scenario +-------- + +The permission for some HDFS directories is **777** or **750** by default, which brings potential security risks. You are advised to modify the permission for the HDFS directories after the HDFS is installed to increase user security. + +Procedure +--------- + +Log in to the HDFS client as the administrator and run the following command to modify the permission for the **/user** directory. + +The permission is set to **1777**, that is, **1** is added to the original permission. This indicates that only the user who creates the directory can delete it. + +**hdfs dfs -chmod 1777** */user* + +To ensure security of the system file, you are advised to harden the security for non-temporary directories. The following directories are examples: + +- /user:777 +- /mr-history:777 +- /mr-history/tmp:777 +- /mr-history/done:777 +- /user/mapred:755 diff --git a/doc/component-operation-guide/source/using_hdfs/configuring_hdfs_nodelabel.rst b/doc/component-operation-guide/source/using_hdfs/configuring_hdfs_nodelabel.rst new file mode 100644 index 0000000..d4ac63f --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/configuring_hdfs_nodelabel.rst @@ -0,0 +1,240 @@ +:original_name: mrs_01_1676.html + +.. _mrs_01_1676: + +Configuring HDFS NodeLabel +========================== + +Scenario +-------- + +You need to configure the nodes for storing HDFS file data blocks based on data features. You can configure a label expression to an HDFS directory or file and assign one or more labels to a DataNode so that file data blocks can be stored on specified DataNodes. + +If the label-based data block placement policy is used for selecting DataNodes to store the specified files, the DataNode range is specified based on the label expression. Then proper nodes are selected from the specified range. + +.. note:: + + This section applies to MRS 3.\ *x* or later. + + After cross-AZ HA is enabled for a single cluster, the HDFS NodeLabel function cannot be configured. + +- Scenario 1: DataNodes partitioning scenario + + Scenario description: + + When different application data is required to run on different nodes for separate management, label expressions can be used to achieve separation of different services, storing specified services on corresponding nodes. + + By configuring the NodeLabel feature, you can perform the following operations: + + - Store data in **/HBase** to DN1, DN2, DN3, and DN4. + - Store data in **/Spark** to DN5, DN6, DN7, and DN8. + + .. _mrs_01_1676__f29094c7c7de94c108e1f8ddea541eab7: + + .. figure:: /_static/images/en-us_image_0000001295930800.png + :alt: **Figure 1** DataNode partitioning scenario + + **Figure 1** DataNode partitioning scenario + + .. note:: + + - Run the **hdfs nodelabel -setLabelExpression -expression 'LabelA[fallback=NONE]' -path /Hbase** command to set an expression for the **Hbase** directory. As shown in :ref:`Figure 1 `, the data block replicas of files in the **/Hbase** directory are placed on the nodes labeled with the **LabelA**, that is, DN1, DN2, DN3, and DN4. Similarly, run the **hdfs nodelabel -setLabelExpression -expression 'LabelB[fallback=NONE]' -path /Spark** command to set an expression for the Spark directory. Data block replicas of files in the **/Spark** directory can be placed only on nodes labeled with **LabelB**, that is, DN5, DN6, DN7, and DN8. + - For details about how to set labels for a data node, see :ref:`Configuration Description `. + - If multiple racks are available in one cluster, it is recommended that DataNodes of these racks should be available under each label, to ensure reliability of data block placement. + +- Scenario 2: Specifying replica location when there are multiple racks + + Scenario description: + + In a heterogeneous cluster, customers need to allocate certain nodes with high availability to store important commercial data. Label expressions can be used to specify replica location so that the replica can be placed on a high reliable node. + + Data blocks in the **/data** directory have three replicas by default. In this case, at least one replica is stored on a node of RACK1 or RACK2 (nodes of RACK1 and RACK2 are high reliable), and the other two are stored separately on the nodes of RACK3 and RACK4. + + + .. figure:: /_static/images/en-us_image_0000001349170365.png + :alt: **Figure 2** Scenario example + + **Figure 2** Scenario example + + .. note:: + + Run the **hdfs nodelabel -setLabelExpression -expression 'LabelA||LabelB[fallback=NONE],LabelC,LabelD' -path /data** command to set an expression for the **/data** directory. + + When data is to be written to the **/data** directory, at least one data block replica is stored on a node labeled with the LabelA or LabelB, and the other two data block replicas are stored separately on the nodes labeled with the LabelC and LabelD. + +.. _mrs_01_1676__s7752fba8102e4f20ae2c86f564e2114c: + +Configuration Description +------------------------- + +- DataNode label configuration + + Go to the **All Configurations** page of HDFS and enter a parameter name in the search box by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + + .. table:: **Table 1** Parameter description + + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------+ + | Parameter | Description | Default Value | + +================================+==========================================================================================================================================================================================================================================================================================================================================================================================================+==================================================================================+ + | dfs.block.replicator.classname | Used to configure the DataNode policy of HDFS. | org.apache.hadoop.hdfs.server.blockmanagement.AvailableSpaceBlockPlacementPolicy | + | | | | + | | To enable the NodeLabel function, set this parameter to **org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithNodeLabel**. | | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------+ + | host2tags | Used to configure a mapping between a DataNode host and a label. | ``-`` | + | | | | + | | The host name can be configured with an IP address extension expression (for example, **192.168.1.[1-128]** or **192.168.[2-3].[1-128]**) or a regular expression (for example, **/datanode-[123]/** or **/datanode-\\d{2}/**) starting and ending with a slash (/). The label configuration name cannot contain the following characters: = / \\ **Note**: The IP address must be a service IP address. | | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------+ + + .. note:: + + - The **host2tags** configuration item is described as follows: + + Assume there are 20 DataNodes which range from dn-1 to dn-20 in a cluster and the IP addresses of clusters range from 10.1.120.1 to 10.1.120.20. The value of **host2tags** can be represented in either of the following methods: + + **Regular expression of the host name** + + **/dn-\\d/ = label-1** indicates that the labels corresponding to dn-1 to dn-9 are label-1, that is, dn-1 = label-1, dn-2 = label-1, ..., dn-9 = label-1. + + **/dn-((1[0-9]$)|(20$))/ = label-2** indicates that the labels corresponding to dn-10 to dn-20 are label-2, that is, dn-10 = label-2, dn-11 = label-2, ...dn-20 = label-2. + + **IP address range expression** + + **10.1.120.[1-9] = label-1** indicates that the labels corresponding to 10.1.120.1 to 10.1.120.9 are label-1, that is, 10.1.120.1 = label-1, 10.1.120.2 = label-1, ..., and 10.1.120.9 = label-1. + + **10.1.120.[10-20] = label-2** indicates that the labels corresponding to 10.1.120.10 to 10.1.120.20 are label-2, that is, 10.1.120.10 = label-2, 10.1.120.11 = label-2, ..., and 10.1.120.20 = label-2. + + - Label-based data block placement policies are applicable to capacity expansion and reduction scenarios. + + A newly added DataNode will be assigned a label if the IP address of the DataNode is within the IP address range in the **host2tags** configuration item or the host name of the DataNode matches the host name regular expression in the **host2tags** configuration item. + + For example, the value of **host2tags** is **10.1.120.[1-9] = label-1**, but the current cluster has only three DataNodes: 10.1.120.1 to 10.1.120.3. If DataNode 10.1.120.4 is added for capacity expansion, the DataNode is labeled as label-1. If the 10.1.120.3 DataNode is deleted or out of the service, no data block will be allocated to the node. + +- Set label expressions for directories or files. + + - On the HDFS parameter configuration page, configure **path2expression** to configure the mapping between HDFS directories and labels. If the configured HDFS directory does not exist, the configuration can succeed. When a directory with the same name as the HDFS directory is created manually, the configured label mapping relationship will be inherited by the directory within 30 minutes. After a labeled directory is deleted, a new directory with the same name as the deleted one will inherit its mapping within 30 minutes. + - For details about configuring items using commands, see the **hdfs nodelabel -setLabelExpression** command. + - To set label expressions using the Java API, invoke the **setLabelExpression(String src, String labelExpression)** method using the instantiated object NodeLabelFileSystem. *src* indicates a directory or file path on HDFS, and **labelExpression** indicates the label expression. + +- After the NodeLabel is enabled, you can run the **hdfs nodelabel -listNodeLabels** command to view the label information of each DataNode. + +Block Replica Location Selection +-------------------------------- + +Nodelabel supports different placement policies for replicas. The expression **label-1,label-2,label-3** indicates that three replicas are respectively placed in DataNodes containing label-1, label-2, and label-3. Different replica policies are separated by commas (,). + +If you want to place two replicas in DataNode with label-1, set the expression as follows: **label-1[replica=2],label-2,label-3**. In this case, if the default number of replicas is 3, two nodes with label-1 and one node with label-2 are selected. If the default number of replicas is 4, two nodes with label-1, one node with label-2, and one node with label-3 are selected. Note that the number of replicas is the same as that of each replica policy from left to right. However, the number of replicas sometimes exceeds the expressions. If the default number of replicas is 5, the extra replica is placed on the last node, that is, the node labeled with label-3. + +When the ACLs function is enabled and the user does not have the permission to access the labels used in the expression, the DataNode with the label is not selected for the replica. + +Deletion of Redundant Block Replicas +------------------------------------ + +If the number of block replicas exceeds the value of **dfs.replication** (number of file replicas specified by the user), HDFS will delete redundant block replicas to ensure cluster resource usage. + +The deletion rules are as follows: + +- Preferentially delete replicas that do not meet any expression. + + For example: The default number of file replicas is **3**. + + The label expression of **/test** is **LA[replica=1],LB[replica=1],LC[replica=1]**. + + The file replicas of **/test** are distributed on four nodes (D1 to D4), corresponding to labels (LA to LD). + + .. code-block:: + + D1:LA + D2:LB + D3:LC + D4:LD + + Then, block replicas on node D4 will be deleted. + +- If all replicas meet the expressions, delete the redundant replicas which are beyond the number specified by the expression. + + For example: The default number of file replicas is **3**. + + The label expression of **/test** is **LA[replica=1],LB[replica=1],LC[replica=1]**. + + The file replicas of **/test** are distributed on the following four nodes, corresponding to the following labels. + + .. code-block:: + + D1:LA + D2:LA + D3:LB + D4:LC + + Then, block replicas on node D1 or D2 will be deleted. + +- If a file owner or group of a file owner cannot access a label, preferentially delete the replica from the DataNode mapped to the label. + +Example of label-based block placement policy +--------------------------------------------- + +Assume that there are six DataNodes, namely, dn-1, dn-2, dn-3, dn-4, dn-5, and dn-6 in a cluster and the corresponding IP address range is 10.1.120.[1-6]. Six directories must be configured with label expressions. The default number of block replicas is **3**. + +- The following provides three expressions of the DataNode label in **host2labels** file. The three expressions have the same function. + + - Regular expression of the host name + + .. code-block:: + + /dn-[1456]/ = label-1,label-2 + /dn-[26]/ = label-1,label-3 + /dn-[3456]/ = label-1,label-4 + /dn-5/ = label-5 + + - IP address range expression + + .. code-block:: + + 10.1.120.[1-6] = label-1 + 10.1.120.1 = label-2 + 10.1.120.2 = label-3 + 10.1.120.[3-6] = label-4 + 10.1.120.[4-6] = label-2 + 10.1.120.5 = label-5 + 10.1.120.6 = label-3 + + - Common host name expression + + .. code-block:: + + /dn-1/ = label-1, label-2 + /dn-2/ = label-1, label-3 + /dn-3/ = label-1, label-4 + /dn-4/ = label-1, label-2, label-4 + /dn-5/ = label-1, label-2, label-4, label-5 + /dn-6/ = label-1, label-2, label-3, label-4 + +- The label expressions of the directories are set as follows: + + .. code-block:: + + /dir1 = label-1 + /dir2 = label-1 && label-3 + /dir3 = label-2 || label-4[replica=2] + /dir4 = (label-2 || label-3) && label-4 + /dir5 = !label-1 + /sdir2.txt = label-1 && label-3[replica=3,fallback=NONE] + /dir6 = label-4[replica=2],label-2 + + .. note:: + + For details about the label expression configuration, see the **hdfs nodelabel -setLabelExpression** command. + + The file data block storage locations are as follows: + + - Data blocks of files in the **/dir1** directory can be stored on any of the following nodes: dn-1, dn-2, dn-3, dn-4, dn-5, and dn-6. + - Data blocks of files in the **/dir2** directory can be stored on the dn-2 and dn-6 nodes. The default number of block replicas is **3**. The expression matches only two DataNodes. The third replica will be stored on one of the remaining nodes in the cluster. + - Data blocks of files in the **/dir3** directory can be stored on any three of the following nodes: dn-1, dn-3, dn-4, dn-5, and dn-6. + - Data blocks of files in the **/dir4** directory can be stored on the dn-4, dn-5, and dn-6 nodes. + - Data blocks of files in the **/dir5** directory do not match any DataNode and will be stored on any three nodes in the cluster, which is the same as the default block selection policy. + - For the data blocks of the **/sdir2.txt** file, two replicas are stored on the dn-2 and dn-6 nodes. The left one is not stored in the node because **fallback=NONE** is enabled. + - Data blocks of the files in the **/dir6** directory are stored on the two nodes with label-4 selected from dn-3, dn-4, dn-5, and dn-6 and another node with label-2. If the specified number of file replicas in the **/dir6** directory is more than 3, the extra replicas will be stored on a node with label-2. + +Restrictions +------------ + +In configuration files, **key** and **value** are separated by equation signs (=), colons (:), and whitespace. Therefore, the host name of the **key** cannot contain these characters because these characters may be considered as separators. diff --git a/doc/component-operation-guide/source/using_hdfs/configuring_memory_management.rst b/doc/component-operation-guide/source/using_hdfs/configuring_memory_management.rst new file mode 100644 index 0000000..f9417fc --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/configuring_memory_management.rst @@ -0,0 +1,40 @@ +:original_name: mrs_01_0791.html + +.. _mrs_01_0791: + +Configuring Memory Management +============================= + +Scenario +-------- + +In HDFS, each file object needs to register corresponding information in the NameNode and occupies certain storage space. As the number of files increases, if the original memory space cannot store the corresponding information, you need to change the memory size. + +Configuration Description +------------------------- + +**Navigation path for setting parameters:** + +Go to the **All Configurations** page of HDFS by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + +.. table:: **Table 1** Parameter description + + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Default Value | + +=======================+========================================================================================================================================================================================================================================================+==========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | GC_PROFILE | The NameNode memory size depends on the size of FsImage, which can be calculated based on the following formula: FsImage size = Number of files x 900 bytes. You can estimate the memory size of the NameNode of HDFS based on the calculation result. | custom | + | | | | + | | The value range of this parameter is as follows: | | + | | | | + | | - **high**: 4 GB | | + | | - **medium**: 2 GB | | + | | - **low**: 256 MB | | + | | - **custom**: The memory size can be set according to the data size in GC_OPTS. | | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | GC_OPTS | JVM parameter used for garbage collection (GC). This parameter is valid only when **GC_PROFILE** is set to **custom**. Ensure that the **GC_OPT** parameter is set correctly. Otherwise, the process will fail to be started. | -Xms2G -Xmx4G -XX:NewSize=128M -XX:MaxNewSize=256M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=128M -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+PrintGCDetails -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFE -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE -XX:-OmitStackTraceInFastThrow -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Djdk.tls.ephemeralDHKeySize=2048 | + | | | | + | | .. important:: | | + | | | | + | | NOTICE: | | + | | Exercise caution when you modify the configuration. If the configuration is incorrect, the services are unavailable. | | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/configuring_nfs.rst b/doc/component-operation-guide/source/using_hdfs/configuring_nfs.rst new file mode 100644 index 0000000..d90b146 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/configuring_nfs.rst @@ -0,0 +1,56 @@ +:original_name: mrs_01_1665.html + +.. _mrs_01_1665: + +Configuring NFS +=============== + +Scenario +-------- + +.. note:: + + This section applies to MRS 3.\ *x* or later. + +Before deploying a cluster, you can deploy a Network File System (NFS) server based on requirements to store NameNode metadata to enhance data reliability. + +If the NFS server has been deployed and NFS services are configured, you can follow operations in this section to configure NFS on the cluster. These operations are optional. + +Procedure +--------- + +#. Check the permission of the shared NFS directories on the NFS server to ensure that the server can access NameNode in the MRS cluster. + +#. .. _mrs_01_1665__lff3e9b51a9354f89ab59a1c515495818: + + Log in to the active NameNode as user **root**. + +#. Run the following commands to create a directory and assign it write permissions: + + **mkdir** **${BIGDATA_DATA_HOME}/namenode-nfs** + + **chown omm:wheel** **${BIGDATA_DATA_HOME}/namenode-nfs** + + **chmod 750** **${BIGDATA_DATA_HOME}/namenode-nfs** + +#. .. _mrs_01_1665__lbb64192db9814446b3744fcbf6326d7b: + + Run the following command to mount the NFS to the active NameNode: + + **mount -t nfs -o rsize=8192,wsize=8192,soft,nolock,timeo=3,intr** *IP address of the NFS server*:*Shared directory* **${BIGDATA_DATA_HOME}/namenode-nfs** + + For example, if the IP address of the NFS server is **192.168.0.11** and the shared directory is **/opt/Hadoop/NameNode**, run the following command: + + **mount -t nfs -o rsize=8192,wsize=8192,soft,nolock,timeo=3,intr 192.168.0.11:/opt/Hadoop/NameNode** **${BIGDATA_DATA_HOME}/namenode-nfs** + +#. Perform :ref:`2 ` to :ref:`4 ` on the standby NameNode. + + .. note:: + + The names of the shared directories (for example, **/opt/Hadoop/NameNode**) created on the NFS server by the active and standby NameNodes must be different. + +#. Log in to FusionInsight Manager, and choose **Cluster** > *Name of the desired cluster* > **Service** > **HDFS** > **Configuration** > **All Configurations**. + +#. In the search box, search for **dfs.namenode.name.dir**, add **${BIGDATA_DATA_HOME}/namenode-nfs** to **Value**, and click **Save**. Separate paths with commas (,). + +#. Click **OK**. On the **Dashboard** tab page, choose **More** > **Restart Service** to restart the service. diff --git a/doc/component-operation-guide/source/using_hdfs/configuring_replica_replacement_policy_for_heterogeneous_capacity_among_datanodes.rst b/doc/component-operation-guide/source/using_hdfs/configuring_replica_replacement_policy_for_heterogeneous_capacity_among_datanodes.rst new file mode 100644 index 0000000..30c2232 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/configuring_replica_replacement_policy_for_heterogeneous_capacity_among_datanodes.rst @@ -0,0 +1,49 @@ +:original_name: mrs_01_0804.html + +.. _mrs_01_0804: + +Configuring Replica Replacement Policy for Heterogeneous Capacity Among DataNodes +================================================================================= + +Scenario +-------- + +By default, NameNode randomly selects a DataNode to write files. If the disk capacity of some DataNodes in a cluster is inconsistent (the total disk capacity of some nodes is large and of some nodes is small), the nodes with small disk capacity will be fully written. To resolve this problem, change the default disk selection policy for data written to DataNode to the available space block policy. This policy increases the probability of writing data blocks to the node with large available disk space. This ensures that the node usage is balanced when disk capacity of DataNodes is inconsistent. + +Impact on the System +-------------------- + +The disk selection policy is changed to **org.apache.hadoop.hdfs.server.blockmanagement.AvailableSpaceBlockPlacementPolicy**. It is proven that the HDFS file write performance optimizes by 3% after the modification. + +.. note:: + + **The default replica storage policy of the NameNode is as follows:** + + #. First replica: stored on the node where the client resides. + #. Second replica: stored on DataNodes of the remote rack. + #. Third replica: stored on different nodes of the same rack for the node where the client resides. + + If there are more replicas, randomly store them on other DataNodes. + + The replica selection mechanism (**org.apache.hadoop.hdfs.server.blockmanagement.AvailableSpaceBlockPlacementPolicy**) is as follows: + + #. First replica: stored on the DataNode where the client resides (the same as the default storage policy). + #. Second replica: + + - When selecting a storage node, select two data nodes that meet the requirements. + - Compare the disk usages of the two DataNodes. If the difference is smaller than 5%, store the replicas to the first node. + - If the difference exceeds 5%, there is a 60% probability (specified by **dfs.namenode.available-space-block-placement-policy.balanced-space-preference-fraction** and default value is **0.6**) that the replica is written to the node whose disk space usage is low. + + #. As for the storage of the third replica and subsequent replicas, refer to that of the second replica. + +Prerequisites +------------- + +The total disk capacity deviation of DataNodes in the cluster cannot exceed 100%. + +Procedure +--------- + +#. Go to the **All Configurations** page of HDFS by referring to :ref:`Modifying Cluster Service Configuration Parameters `. +#. Modify the disk selection policy parameters when HDFS writes data. Search for the **dfs.block.replicator.classname** parameter and change its value to **org.apache.hadoop.hdfs.server.blockmanagement.AvailableSpaceBlockPlacementPolicy**. +#. Save the modified configuration. Restart the expired service or instance for the configuration to take effect. diff --git a/doc/component-operation-guide/source/using_hdfs/configuring_reserved_percentage_of_disk_usage_on_datanodes.rst b/doc/component-operation-guide/source/using_hdfs/configuring_reserved_percentage_of_disk_usage_on_datanodes.rst new file mode 100644 index 0000000..a99125c --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/configuring_reserved_percentage_of_disk_usage_on_datanodes.rst @@ -0,0 +1,35 @@ +:original_name: mrs_01_1675.html + +.. _mrs_01_1675: + +Configuring Reserved Percentage of Disk Usage on DataNodes +========================================================== + +Scenario +-------- + +When the Yarn local directory and DataNode directory are on the same disk, the disk with larger capacity can run more tasks. Therefore, more intermediate data is stored in the Yarn local directory. + +Currently, you can set **dfs.datanode.du.reserved** to configure the absolute value of the reserved disk space on DataNodes. A small value cannot meet the requirements of a disk with large capacity. However, configuring a large value for a disk with same capacity wastes a lot of disk space. + +To avoid this problem, a new parameter **dfs.datanode.du.reserved.percentage** is introduced to configure the reserved percentage of the disk space. + +.. note:: + + - If **dfs.datanode.du.reserved.percentage** and **dfs.datanode.du.reserved** are configured at the same time, the larger value of the reserved disk space calculated using the two parameters is used as the reserved space of the data nodes. + - You are advised to set **dfs.datanode.du.reserved** or **dfs.datanode.du.reserved.percentage** based on the actual disk space. + +Configuration Description +------------------------- + +Go to the **All Configurations** page of HDFS and enter a parameter name in the search box by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + +.. table:: **Table 1** Parameter description + + +-------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Default Value | + +=====================================+======================================================================================================================================================+=======================+ + | dfs.datanode.du.reserved.percentage | Indicates the percentage of the reserved disk space on DataNodes. The DataNode permanently reserves the disk space calculated using this percentage. | 10 | + | | | | + | | The value is an integer ranging from 0 to 100. | | + +-------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/configuring_the_damaged_disk_volume.rst b/doc/component-operation-guide/source/using_hdfs/configuring_the_damaged_disk_volume.rst new file mode 100644 index 0000000..01a7675 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/configuring_the_damaged_disk_volume.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_1669.html + +.. _mrs_01_1669: + +Configuring the Damaged Disk Volume +=================================== + +Scenario +-------- + +In the open source version, if multiple data storage volumes are configured for a DataNode, the DataNode stops providing services by default if one of the volumes is damaged. You can change the value of **dfs.datanode.failed.volumes.tolerated** to specify the number of damaged disk volumes that are allowed. If the number of damaged volumes does not exceed the threshold, DataNode continues to provide services. + +Configuration Description +------------------------- + +**Navigation path for setting parameters:** + +Go to the **All Configurations** page of HDFS and enter a parameter name in the search box by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + +.. table:: **Table 1** Parameter description + + +---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+ + | Parameter | Description | Default Value | + +=======================================+==============================================================================================================================================================================================================================================================================================================================================+==================================+ + | dfs.datanode.failed.volumes.tolerated | Specifies the number of damaged volumes that are allowed before the DataNode stops providing services. By default, there must be at least one valid volume. The value **-1** indicates that the minimum value of a valid volume is **1**. The value greater than or equal to **0** indicates the number of damaged volumes that are allowed. | Versions earlier than MRS 3.x: 0 | + | | | | + | | | MRS 3.\ *x* or later: -1 | + +---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/configuring_the_namenode_blacklist.rst b/doc/component-operation-guide/source/using_hdfs/configuring_the_namenode_blacklist.rst new file mode 100644 index 0000000..43c0237 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/configuring_the_namenode_blacklist.rst @@ -0,0 +1,58 @@ +:original_name: mrs_01_1670.html + +.. _mrs_01_1670: + +Configuring the NameNode Blacklist +================================== + +Scenario +-------- + +.. note:: + + This section applies to MRS 3.\ *x* or later. + +In the existing default DFSclient failover proxy provider, if a NameNode in a process is faulty, all HDFS client instances in the same process attempt to connect to the NameNode again. As a result, the application waits for a long time and timeout occurs. + +When clients in the same JVM process connect to the NameNode that cannot be accessed, the system is overloaded. The NameNode blacklist is equipped with the MRS cluster to avoid this problem. + +In the new Blacklisting DFSClient failover provider, the faulty NameNode is recorded in a list. The DFSClient then uses the information to prevent the client from connecting to such NameNodes again. This function is called NameNode blacklisting. + +For example, there is a cluster with the following configurations: + +namenode: nn1, nn2 + +dfs.client.failover.connection.retries: 20 + +Processes in a single JVM: 10 clients + +In the preceding cluster, if the active **nn1** cannot be accessed, client1 will retry the connection for 20 times. Then, a failover occurs, and client1 will connect to **nn2**. In the same way, other clients also connect to **nn2** when the failover occurs after retrying the connection to **nn1** for 20 times. Such process prolongs the fault recovery of NameNode. + +In this case, the NameNode blacklisting adds **nn1** to the blacklist when client1 attempts to connect to the active **nn1** which is already faulty. Therefore, other clients will avoid trying to connect to **nn1** but choose **nn2** directly. + +.. note:: + + If, at any time, all NameNodes are added to the blacklist, the content in the blacklist will be cleared, and the client attempts to connect to the NameNodes based on the initial NameNode list. If any fault occurs again, the NameNode is still added to the blacklist. + + +.. figure:: /_static/images/en-us_image_0000001296090668.jpg + :alt: **Figure 1** NameNode blacklisting working principle + + **Figure 1** NameNode blacklisting working principle + +Configuration Description +------------------------- + +Go to the **All Configurations** page of HDFS and enter a parameter name in the search box by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + +.. table:: **Table 1** NameNode blacklisting parameters + + +---------------------------------------------------------+---------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ + | Parameter | Description | Default Value | + +=========================================================+=========================================================================================================+=========================================================================+ + | dfs.client.failover.proxy.provider.\ *[nameservice ID]* | Client Failover proxy provider class which creates the NameNode proxy using the authenticated protocol. | org.apache.hadoop.hdfs.server.namenode.ha.AdaptiveFailoverProxyProvider | + | | | | + | | Set this parameter to **org.apache.hadoop.hdfs.server.namenode.ha.BlackListingFailoverProxyProvider**. | | + | | | | + | | You can configure the observer NameNode to process read requests. | | + +---------------------------------------------------------+---------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/configuring_the_number_of_files_in_a_single_hdfs_directory.rst b/doc/component-operation-guide/source/using_hdfs/configuring_the_number_of_files_in_a_single_hdfs_directory.rst new file mode 100644 index 0000000..e04cce9 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/configuring_the_number_of_files_in_a_single_hdfs_directory.rst @@ -0,0 +1,35 @@ +:original_name: mrs_01_0805.html + +.. _mrs_01_0805: + +Configuring the Number of Files in a Single HDFS Directory +========================================================== + +Scenario +-------- + +Generally, multiple services are deployed in a cluster, and the storage of most services depends on the HDFS file system. Different components such as Spark and Yarn or clients are constantly writing files to the same HDFS directory when the cluster is running. However, the number of files in a single directory in HDFS is limited. Users must plan to prevent excessive files in a single directory and task failure. + +You can set the number of files in a single directory using the **dfs.namenode.fs-limits.max-directory-items** parameter in HDFS. + +Procedure +--------- + +#. Go to the **All Configurations** page of HDFS by referring to :ref:`Modifying Cluster Service Configuration Parameters `. +#. Search for the configuration item **dfs.namenode.fs-limits.max-directory-items**. + + .. table:: **Table 1** Parameter description + + +--------------------------------------------+----------------------------------------+-----------------------+ + | Parameter | Description | Default Value | + +============================================+========================================+=======================+ + | dfs.namenode.fs-limits.max-directory-items | Maximum number of items in a directory | 1048576 | + | | | | + | | Value range: 1 to 6,400,000 | | + +--------------------------------------------+----------------------------------------+-----------------------+ + +#. Set the maximum number of files that can be stored in a single HDFS directory. Save the modified configuration. Restart the expired service or instance for the configuration to take effect. + + .. note:: + + Plan data storage in advance based on time and service type categories to prevent excessive files in a single directory. You are advised to use the default value, which is about 1 million pieces of data in a single directory. diff --git a/doc/component-operation-guide/source/using_hdfs/configuring_the_observer_namenode_to_process_read_requests.rst b/doc/component-operation-guide/source/using_hdfs/configuring_the_observer_namenode_to_process_read_requests.rst new file mode 100644 index 0000000..294200e --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/configuring_the_observer_namenode_to_process_read_requests.rst @@ -0,0 +1,49 @@ +:original_name: mrs_01_1681.html + +.. _mrs_01_1681: + +Configuring the Observer NameNode to Process Read Requests +========================================================== + +Scenario +-------- + +In an HDFS cluster configured with HA, the active NameNode processes all client requests, and the standby NameNode reserves the latest metadata and block location information. However, in this architecture, the active NameNode is the bottleneck of client request processing. This bottleneck is more obvious in clusters with a large number of requests. + +To address this issue, a new NameNode is introduced: an observer NameNode. Similar to the standby NameNode, the observer NameNode also reserves the latest metadata information and block location information. In addition, the observer NameNode can process read requests from clients in the same way as the active NameNode. In typical HDFS clusters with many read requests, the observer NameNode can be used to process read requests, reducing the active NameNode load and improving the cluster capability of processing requests. + +.. note:: + + This section applies to MRS 3.\ *x* or later. + +Impact on the System +-------------------- + +- The active NameNode load can be reduced and the capability of HDFS cluster processing requests can be improved, which is especially obvious for large clusters. +- The client application configuration needs to be updated. + +Prerequisites +------------- + +- The HDFS cluster has been installed, the active and standby NameNodes are running properly, and the HDFS service is normal. +- The **${BIGDATA_DATA_HOME}/namenode** partition has been created on the node where the observer NameNode is to be installed. + +Procedure +--------- + +The following steps describe how to configure the observer NameNode of a hacluster and enable it to process read requests. If there are multiple pairs of NameServices in the cluster and they are all in use, perform the following steps to configure the observer NameNode for each pair. + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **NameService Management**. +#. Click **Add** next to **hacluster**. +#. On the **Add NameNode** page, set **NameNode type** to **Observer** and click **Next**. +#. On the **Assign Role** page, select the planned host, add the observer NameNode, and click **Next**. + + .. note:: + + A maximum of five observer NameNodes can be added to each pair of NameServices. + +#. On the configuration page, configure the storage directory and port number of the NameNode as planned and click **Next**. +#. Confirm the information, click **Submit**, and wait until the installation of the observer NameNode is complete. + +8. Restart the upper-layer components that depend on HDFS, update the client application configuration, and restart the client application. diff --git a/doc/component-operation-guide/source/using_hdfs/configuring_the_recycle_bin_mechanism.rst b/doc/component-operation-guide/source/using_hdfs/configuring_the_recycle_bin_mechanism.rst new file mode 100644 index 0000000..98de135 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/configuring_the_recycle_bin_mechanism.rst @@ -0,0 +1,40 @@ +:original_name: mrs_01_0806.html + +.. _mrs_01_0806: + +Configuring the Recycle Bin Mechanism +===================================== + +Scenario +-------- + +On HDFS, deleted files are moved to the recycle bin (trash can) so that the data deleted by mistake can be restored. + +You can set the time threshold for storing files in the recycle bin. Once the file storage duration exceeds the threshold, it is permanently deleted from the recycle bin. If the recycle bin is cleared, all files in the recycle bin are permanently deleted. + +Configuration Description +------------------------- + +If a file is deleted from HDFS, the file is saved in the trash space rather than cleared immediately. After the aging time is due, the deleted file becomes an aging file and will be cleared based on the system mechanism or manually cleared by users. + +**Parameter portal:** + +Go to the **All Configurations** page of HDFS and enter a parameter name in the search box by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + +.. table:: **Table 1** Parameter description + + +------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Default Value | + +==============================+===================================================================================================================================================================================================================================================================================================================================================================================================================================================================+=======================+ + | fs.trash.interval | Trash collection time, in minutes. If data in the trash station exceeds the time, the data will be deleted. Value range: 1440 to 259200 | 1440 | + +------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | fs.trash.checkpoint.interval | Interval between trash checkpoints, in minutes. The value must be less than or equal to the value of **fs.trash.interval**. The checkpoint program creates a checkpoint every time it runs and removes the checkpoint created **fs.trash.interval** minutes ago. For example, the system checks whether aging files exist every 10 minutes and deletes aging files if any. Files that are not aging are stored in the checkpoint list waiting for the next check. | 60 | + | | | | + | | If this parameter is set to 0, the system does not check aging files and all aging files are saved in the system. | | + | | | | + | | Value range: 0 to *fs.trash.interval* | | + | | | | + | | .. note:: | | + | | | | + | | It is not recommended to set this parameter to 0 because aging files will use up the disk space of the cluster. | | + +------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/configuring_ulimit_for_hbase_and_hdfs.rst b/doc/component-operation-guide/source/using_hdfs/configuring_ulimit_for_hbase_and_hdfs.rst new file mode 100644 index 0000000..e661b0d --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/configuring_ulimit_for_hbase_and_hdfs.rst @@ -0,0 +1,53 @@ +:original_name: mrs_01_0801.html + +.. _mrs_01_0801: + +Configuring ulimit for HBase and HDFS +===================================== + +Symptom +------- + +When you open an HDFS file, an error occurs due to the limit on the number of file handles. Information similar to the following is displayed. + +.. code-block:: + + IOException (Too many open files) + +Procedure +--------- + +You can contact the system administrator to add file handles for each user. This is a configuration on the OS instead of HBase or HDFS. It is recommended that the system administrator configure the number of file handles based on the service traffic of HBase and HDFS and the rights of each user. If a user performs a large number of operations frequently on the HDFS that has large service traffic, set the number of file handles of this user to a large value. + +#. Log in to the OSs of all nodes or clients in the cluster as user **root**, and go to the **/etc/security** directory. + +#. Run the following command to edit the **limits.conf** file: + + **vi limits.conf** + + Add the following information to the file. + + .. code-block:: + + hdfs - nofile 32768 + hbase - nofile 32768 + + **hdfs** and **hbase** indicate the usernames of the OSs that are used during the services. + + .. note:: + + - Only user **root** has the rights to edit the **limits.conf** file. + - If this modification does not take effect, check whether other nofile values exist in the **/etc/security/limits.d** directory. Such values may overwrite the values set in the **/etc/security/limits.conf** file. + - If a user needs to perform operations on HBase, set the number of file handles of this user to a value greater than **10000**. If a user needs to perform operations on HDFS, set the number of file handles of this user based on the service traffic. It is recommended that the value not be too small. If a user needs to perform operations on both HBase and HDFS, set the number of file handles of this user to a large value, such as **32768**. + +#. Run the following command to check the limit on the number of file handles of a user: + + **su -** *user_name* + + **ulimit -n** + + The limit on the number of file handles of this user is displayed as follows. + + .. code-block:: + + 8194 diff --git a/doc/component-operation-guide/source/using_hdfs/creating_an_hdfs_role.rst b/doc/component-operation-guide/source/using_hdfs/creating_an_hdfs_role.rst new file mode 100644 index 0000000..1e4ad1e --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/creating_an_hdfs_role.rst @@ -0,0 +1,87 @@ +:original_name: mrs_01_1662.html + +.. _mrs_01_1662: + +Creating an HDFS Role +===================== + +Scenario +-------- + +This section describes how to create and configure an HDFS role on FusionInsight Manager. The HDFS role is granted the rights to read, write, and execute HDFS directories or files. + +A user has the complete permission on the created HDFS directories or files, that is, the user can directly read data from and write data to as well as authorize others to access the HDFS directories or files. + +.. note:: + + - This section applies to MRS 3.\ *x* or later clusters. + - An HDFS role can be created only in security mode. + - If the current component uses Ranger for permission control, HDFS policies must be configured based on Ranger for permission management. For details, see :ref:`Adding a Ranger Access Permission Policy for HDFS `. + +Prerequisites +------------- + +The system administrator has understood the service requirements. + +Procedure +--------- + +#. Log in to FusionInsight Manager, and choose **System** > **Permission** > **Role**. + +#. On the displayed page, click **Create Role** and fill in **Role Name** and **Description**. + +#. Configure the resource permission. For details, see :ref:`Table 1 `. + + **File System**: HDFS directory and file permission + + Common HDFS directories are as follows: + + - **flume**: Flume data storage directory + + - **hbase**: HBase data storage directory + + - **mr-history**: MapReduce task information storage directory + + - **tmp**: temporary data storage directory + + - **user**: user data storage directory + + .. _mrs_01_1662__tc5a4f557e6144488a1ace112bb8db6ee: + + .. table:: **Table 1** Setting a role + + +-------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | Task | Operation | + +===================================================================================================================+======================================================================================================================================+ + | Setting the HDFS administrator permission | In the **Configure Resource Permission** area, choose *Name of the desired cluster* > HDFS, and select **Cluster Admin Operations**. | + | | | + | | .. note:: | + | | | + | | The setting takes effect after the HDFS service is restarted. | + +-------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | Setting the permission for users to check and recover HDFS | a. In the **Configure Resource Permission** area, choose *Name of the desired cluster* > HDFS > **File System**. | + | | b. Locate the save path of specified directories or files on HDFS. | + | | c. In the **Permission** column of the specified directories or files, select **Read** and **Execute**. | + +-------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | Setting the permission for users to read directories or files of other users | a. In the **Configure Resource Permission** area, choose *Name of the desired cluster* > HDFS > **File System**. | + | | b. Locate the save path of specified directories or files on HDFS. | + | | c. In the **Permission** column of the specified directories or files, select **Read** and **Execute**. | + +-------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | Setting the permission for users to write data to files of other users | a. In the **Configure Resource Permission** area, choose *Name of the desired cluster* > HDFS > **File System**. | + | | b. Locate the save path of specified files on HDFS. | + | | c. In the **Permission** column of the specified files, select **Write** and **Execute**. | + +-------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | Setting the permission for users to create or delete sub-files or sub-directories in the directory of other users | a. In the **Configure Resource Permission** area, choose *Name of the desired cluster* > HDFS > **File System**. | + | | b. Locate the path where the specified directory is saved in the HDFS. | + | | c. In the **Permission** column of the specified directories, select **Write** and **Execute**. | + +-------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | Setting the permission for users to execute directories or files of other users | a. In the **Configure Resource Permission** area, choose *Name of the desired cluster* > HDFS > **File System**. | + | | b. Locate the save path of specified directories or files on HDFS. | + | | c. In the **Permission** column of the specified directories or files, select **Execute**. | + +-------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + | Setting the permission for allowing subdirectories to inherit all permissions of their parent directories | a. In the **Configure Resource Permission** area, choose *Name of the desired cluster* > HDFS > **File System**. | + | | b. Locate the save path of specified directories or files on HDFS. | + | | c. In the **Permission** column of the specified directories or files, select **Recursive**. | + +-------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ + +#. Click **OK**, and return to the **Role** page. diff --git a/doc/component-operation-guide/source/using_hdfs/faq/blocks_miss_on_the_namenode_ui_after_the_successful_rollback.rst b/doc/component-operation-guide/source/using_hdfs/faq/blocks_miss_on_the_namenode_ui_after_the_successful_rollback.rst new file mode 100644 index 0000000..96ec9ec --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/blocks_miss_on_the_namenode_ui_after_the_successful_rollback.rst @@ -0,0 +1,62 @@ +:original_name: mrs_01_1704.html + +.. _mrs_01_1704: + +Blocks Miss on the NameNode UI After the Successful Rollback +============================================================ + +Question +-------- + +Why are some blocks missing on the NameNode UI after the rollback is successful? + +Answer +------ + +This problem occurs because blocks with new IDs or genstamps may exist on the DataNode. The block files in the DataNode may have different generation flags and lengths from those in the rollback images of the NameNode. Therefore, the NameNode rejects these blocks in the DataNode and marks the files as damaged. + +**Scenarios:** + +#. Before an upgrade: + + Client A writes some data to file X. (Assume A bytes are written.) + +2. During an upgrade: + + Client A still writes data to file X. (The data in the file is A + B bytes.) + +3. After an upgrade: + + Client A completes the file writing. The final data is A + B bytes. + +4. Rollback started: + + The status will be rolled back to the status before the upgrade. That is, file X in NameNode will have A bytes, but block files in DataNode will have A + B bytes. + +**Recovery procedure:** + +#. Obtain the list of damaged files from NameNode web UI or run the following command to obtain: + + **hdfs fsck -list-corruptfileblocks** + +#. Run the following command to delete unnecessary files: + + **hdfs fsck - delete** + + .. note:: + + Deleting a file is a high-risk operation. Ensure that the files are no longer needed before performing this operation. + +#. For the required files, run the **fsck** command to obtain the block list and block sequence. + + - In the block sequence table provided, use the block ID to search for the data directory in the DataNode and download the corresponding block from the DataNode. + + - Write all such block files in appending mode based on the sequence to construct the original file. + + Example: + + File 1--> blk_1, blk_2, blk_3 + + Create a file by combining the contents of all three block files from the same sequence. + + - Delete the old file from HDFS and rewrite the new file. diff --git a/doc/component-operation-guide/source/using_hdfs/faq/can_i_delete_or_modify_the_data_storage_directory_in_datanode.rst b/doc/component-operation-guide/source/using_hdfs/faq/can_i_delete_or_modify_the_data_storage_directory_in_datanode.rst new file mode 100644 index 0000000..1ac57c2 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/can_i_delete_or_modify_the_data_storage_directory_in_datanode.rst @@ -0,0 +1,30 @@ +:original_name: mrs_01_1703.html + +.. _mrs_01_1703: + +Can I Delete or Modify the Data Storage Directory in DataNode? +============================================================== + +Question +-------- + +- In DataNode, the storage directory of data blocks is specified by **dfs.datanode.data.dir**\ **.** Can I modify **dfs.datanode.data.dir** to modify the data storage directory? +- Can I modify files under the data storage directory? + +Answer +------ + +During the system installation, you need to configure the **dfs.datanode.data.dir** parameter to specify one or more root directories. + +- During the system installation, you need to configure the dfs.datanode.data.dir parameter to specify one or more root directories. + +- Exercise caution when modifying dfs.datanode.data.dir. You can configure this parameter to add a new data root directory. +- Do not modify or delete data blocks in the storage directory. Otherwise, the data blocks will lose. + +.. note:: + + Similarly, do not delete the storage directory, or modify or delete data blocks under the directory using the following parameters: + + - dfs.namenode.edits.dir + - dfs.namenode.name.dir + - dfs.journalnode.edits.dir diff --git a/doc/component-operation-guide/source/using_hdfs/faq/datanode_is_normal_but_cannot_report_data_blocks.rst b/doc/component-operation-guide/source/using_hdfs/faq/datanode_is_normal_but_cannot_report_data_blocks.rst new file mode 100644 index 0000000..38e4faa --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/datanode_is_normal_but_cannot_report_data_blocks.rst @@ -0,0 +1,62 @@ +:original_name: mrs_01_1693.html + +.. _mrs_01_1693: + +DataNode Is Normal but Cannot Report Data Blocks +================================================ + +Question +-------- + +The DataNode is normal, but cannot report data blocks. As a result, the existing data blocks cannot be used. + +Answer +------ + +This error may occur when the number of data blocks in a data directory exceeds four times the upper limit (4 x 1 MB). And the DataNode generates the following error logs: + +.. code-block:: + + 2015-11-05 10:26:32,936 | ERROR | DataNode:[[[DISK]file:/srv/BigData/hadoop/data1/dn/]] heartbeating to + vm-210/10.91.8.210:8020 | Exception in BPOfferService for Block pool BP-805114975-10.91.8.210-1446519981645 + (Datanode Uuid bcada350-0231-413b-bac0-8c65e906c1bb) service to vm-210/10.91.8.210:8020 | BPServiceActor.java:824 + java.lang.IllegalStateException:com.google.protobuf.InvalidProtocolBufferException:Protocol message was too large.May + be malicious.Use CodedInputStream.setSizeLimit() to increase the size limit. at org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:369) + at org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:347) at org.apache.hadoop.hdfs. + protocol.BlockListAsLongs$BufferDecoder.getBlockListAsLongs(BlockListAsLongs.java:325) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB. + blockReport(DatanodeProtocolClientSideTranslatorPB.java:190) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:473) + at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:685) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:822) + at java.lang.Thread.run(Thread.java:745) Caused by:com.google.protobuf.InvalidProtocolBufferException:Protocol message was too large.May be malicious.Use CodedInputStream.setSizeLimit() + to increase the size limit. at com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110) at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755) + at com.google.protobuf.CodedInputStream.readRawByte(CodedInputStream.java:769) at com.google.protobuf.CodedInputStream.readRawVarint64(CodedInputStream.java:462) at com.google.protobuf. + CodedInputStream.readSInt64(CodedInputStream.java:363) at org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:363) + +The number of data blocks in the data directory is displayed as **Metric**. You can monitor its value through **http://:/jmx**. If the value is greater than four times the upper limit (4 x 1 MB), you are advised to configure multiple drives and restart HDFS. + +**Recovery procedure:** + +#. Configure multiple data directories on the DataNode. + + For example, configure multiple directories on the DataNode where only the **/data1/datadir** directory is configured: + + .. code-block:: + + dfs.datanode.data.dir /data1/datadir + + Configure as follows: + + .. code-block:: + + dfs.datanode.data.dir /data1/datadir/,/data2/datadir,/data3/datadir + + .. note:: + + You are advised to configure multiple data directories on multiple disks. Otherwise, performance may be affected. + +#. Restart the HDFS. + +#. Perform the following operation to move the data to the new data directory: + + **mv** */data1/datadir/current/finalized/subdir1 /data2/datadir/current/finalized/subdir1* + +#. Restart the HDFS. diff --git a/doc/component-operation-guide/source/using_hdfs/faq/failed_to_calculate_the_capacity_of_a_datanode_when_multiple_data.dir_directories_are_configured_in_a_disk_partition.rst b/doc/component-operation-guide/source/using_hdfs/faq/failed_to_calculate_the_capacity_of_a_datanode_when_multiple_data.dir_directories_are_configured_in_a_disk_partition.rst new file mode 100644 index 0000000..161716f --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/failed_to_calculate_the_capacity_of_a_datanode_when_multiple_data.dir_directories_are_configured_in_a_disk_partition.rst @@ -0,0 +1,49 @@ +:original_name: mrs_01_1697.html + +.. _mrs_01_1697: + +Failed to Calculate the Capacity of a DataNode when Multiple data.dir Directories Are Configured in a Disk Partition +==================================================================================================================== + +Question +-------- + +The capacity of a DataNode fails to calculate when multiple data.dir directories are configured in a disk partition. + +Answer +------ + +Currently, the capacity is calculated based on disks, which is similar to the **df** command in Linux. Ideally, users do not configure multiple data.dir directories in a disk partition. Otherwise, all data will be written to the same disk, greatly deteriorating the performance. + +You are advised to configure them as below. + +For example, if a node contains the following disks: + +.. code-block:: + + host-4:~ # df -h + Filesystem Size Used Avail Use% Mounted on + /dev/sda1 352G 11G 324G 4% / + udev 190G 252K 190G 1% /dev + tmpfs 190G 72K 190G 1% /dev/shm + /dev/sdb1 2.7T 74G 2.5T 3% /data1 + /dev/sdc1 2.7T 75G 2.5T 3% /data2 + /dev/sdd1 2.7T 73G 2.5T 3% /da + +Recommended configuration: + +.. code-block:: + + + dfs.datanode.data.dir + /data1/datadir/,/data2/datadir,/data3/datadir + + +Unrecommended configuration: + +.. code-block:: + + + dfs.datanode.data.dir + /data1/datadir1/,/data2/datadir1,/data3/datadir1,/data1/datadir2,data1/datadir3,/data2/datadir2,/data2/datadir3,/data3/datadir2,/data3/datadir3 + diff --git a/doc/component-operation-guide/source/using_hdfs/faq/hdfs_webui_cannot_properly_update_information_about_damaged_data.rst b/doc/component-operation-guide/source/using_hdfs/faq/hdfs_webui_cannot_properly_update_information_about_damaged_data.rst new file mode 100644 index 0000000..dfb9e56 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/hdfs_webui_cannot_properly_update_information_about_damaged_data.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_1694.html + +.. _mrs_01_1694: + +HDFS WebUI Cannot Properly Update Information About Damaged Data +================================================================ + +Question +-------- + +#. When errors occur in the **dfs.datanode.data.dir** directory of DataNode due to the permission or disk damage, HDFS WebUI does not display information about damaged data. +#. After errors are restored, HDFS WebUI does not timely remove related information about damaged data. + +Answer +------ + +#. DataNode checks whether the disk is normal only when errors occur in file operations. Therefore, only when a data damage is detected and the error is reported to NameNode, NameNode displays information about the damaged data on HDFS WebUI. +#. After errors are fixed, you need to restart DataNode. During restarting DataNode, all data states are checked and damaged data information is uploaded to NameNode. Therefore, after errors are fixed, damaged data information is not displayed on the HDFS WebUI only by restarting DataNode. diff --git a/doc/component-operation-guide/source/using_hdfs/faq/index.rst b/doc/component-operation-guide/source/using_hdfs/faq/index.rst new file mode 100644 index 0000000..004dca2 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/index.rst @@ -0,0 +1,48 @@ +:original_name: mrs_01_1690.html + +.. _mrs_01_1690: + +FAQ +=== + +- :ref:`NameNode Startup Is Slow ` +- :ref:`DataNode Is Normal but Cannot Report Data Blocks ` +- :ref:`HDFS WebUI Cannot Properly Update Information About Damaged Data ` +- :ref:`Why Does the Distcp Command Fail in the Secure Cluster, Causing an Exception? ` +- :ref:`Why Does DataNode Fail to Start When the Number of Disks Specified by dfs.datanode.data.dir Equals dfs.datanode.failed.volumes.tolerated? ` +- :ref:`Failed to Calculate the Capacity of a DataNode when Multiple data.dir Directories Are Configured in a Disk Partition ` +- :ref:`Standby NameNode Fails to Be Restarted When the System Is Powered off During Metadata (Namespace) Storage ` +- :ref:`Why Data in the Buffer Is Lost If a Power Outage Occurs During Storage of Small Files ` +- :ref:`Why Does Array Border-crossing Occur During FileInputFormat Split? ` +- :ref:`Why Is the Storage Type of File Copies DISK When the Tiered Storage Policy Is LAZY_PERSIST? ` +- :ref:`The HDFS Client Is Unresponsive When the NameNode Is Overloaded for a Long Time ` +- :ref:`Can I Delete or Modify the Data Storage Directory in DataNode? ` +- :ref:`Blocks Miss on the NameNode UI After the Successful Rollback ` +- :ref:`Why Is "java.net.SocketException: No buffer space available" Reported When Data Is Written to HDFS ` +- :ref:`Why are There Two Standby NameNodes After the active NameNode Is Restarted? ` +- :ref:`When Does a Balance Process in HDFS, Shut Down and Fail to be Executed Again? ` +- :ref:`"This page can't be displayed" Is Displayed When Internet Explorer Fails to Access the Native HDFS UI ` +- :ref:`NameNode Fails to Be Restarted Due to EditLog Discontinuity ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + namenode_startup_is_slow + datanode_is_normal_but_cannot_report_data_blocks + hdfs_webui_cannot_properly_update_information_about_damaged_data + why_does_the_distcp_command_fail_in_the_secure_cluster,_causing_an_exception + why_does_datanode_fail_to_start_when_the_number_of_disks_specified_by_dfs.datanode.data.dir_equals_dfs.datanode.failed.volumes.tolerated + failed_to_calculate_the_capacity_of_a_datanode_when_multiple_data.dir_directories_are_configured_in_a_disk_partition + standby_namenode_fails_to_be_restarted_when_the_system_is_powered_off_during_metadata_namespace_storage + why_data_in_the_buffer_is_lost_if_a_power_outage_occurs_during_storage_of_small_files + why_does_array_border-crossing_occur_during_fileinputformat_split + why_is_the_storage_type_of_file_copies_disk_when_the_tiered_storage_policy_is_lazy_persist + the_hdfs_client_is_unresponsive_when_the_namenode_is_overloaded_for_a_long_time + can_i_delete_or_modify_the_data_storage_directory_in_datanode + blocks_miss_on_the_namenode_ui_after_the_successful_rollback + why_is_java.net.socketexception_no_buffer_space_available_reported_when_data_is_written_to_hdfs + why_are_there_two_standby_namenodes_after_the_active_namenode_is_restarted + when_does_a_balance_process_in_hdfs,_shut_down_and_fail_to_be_executed_again + this_page_cant_be_displayed_is_displayed_when_internet_explorer_fails_to_access_the_native_hdfs_ui + namenode_fails_to_be_restarted_due_to_editlog_discontinuity diff --git a/doc/component-operation-guide/source/using_hdfs/faq/namenode_fails_to_be_restarted_due_to_editlog_discontinuity.rst b/doc/component-operation-guide/source/using_hdfs/faq/namenode_fails_to_be_restarted_due_to_editlog_discontinuity.rst new file mode 100644 index 0000000..a4a288b --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/namenode_fails_to_be_restarted_due_to_editlog_discontinuity.rst @@ -0,0 +1,39 @@ +:original_name: mrs_01_1709.html + +.. _mrs_01_1709: + +NameNode Fails to Be Restarted Due to EditLog Discontinuity +=========================================================== + +Question +-------- + +If a JournalNode server is powered off, the data directory disk is fully occupied, and the network is abnormal, the EditLog sequence number on the JournalNode is inconsecutive. In this case, the NameNode restart may fail. + +Symptom +------- + +The NameNode fails to be restarted. The following error information is reported in the NameNode run logs: + +|image1| + +Solution +-------- + +#. Find the active NameNode before the restart, go to its data directory (you can obtain the directory, such as **/srv/BigData/namenode/current** by checking the configuration item **dfs.namenode.name.dir**), and obtain the sequence number of the latest FsImage file, as shown in the following figure: + + |image2| + +#. Check the data directory of each JournalNode (you can obtain the directory such as\ **/srv/BigData/journalnode/hacluster/current** by checking the value of the configuration item **dfs.journalnode.edits.dir**), and check whether the sequence number starting from that obtained in step 1 is consecutive in edits files. That is, you need to check whether the last sequence number of the previous edits file is consecutive with the first sequence number of the next edits file. (As shown in the following figure, edits_0000000000013259231-0000000000013259237 and edits_0000000000013259239-0000000000013259246 are not consecutive.) + + |image3| + +#. If the edits files are not consecutive, check whether the edits files with the related sequence number exist in the data directories of other JournalNodes or NameNode. If the edits files can be found, copy a consecutive segment to the JournalNode. + +#. In this way, all inconsecutive edits files are restored. + +#. Restart the NameNode and check whether the restart is successful. If the fault persists, contact technical support. + +.. |image1| image:: /_static/images/en-us_image_0000001349289573.png +.. |image2| image:: /_static/images/en-us_image_0000001348770293.png +.. |image3| image:: /_static/images/en-us_image_0000001295930432.png diff --git a/doc/component-operation-guide/source/using_hdfs/faq/namenode_startup_is_slow.rst b/doc/component-operation-guide/source/using_hdfs/faq/namenode_startup_is_slow.rst new file mode 100644 index 0000000..7f50f0d --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/namenode_startup_is_slow.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_1691.html + +.. _mrs_01_1691: + +NameNode Startup Is Slow +======================== + +Question +-------- + +The NameNode startup is slow when it is restarted immediately after a large number of files (for example, 1 million files) are deleted. + +Answer +------ + +It takes time for the DataNode to delete the corresponding blocks after files are deleted. When the NameNode is restarted immediately, it checks the block information reported by all DataNodes. If a deleted block is found, the NameNode generates the corresponding INFO log information, as shown below: + +.. code-block:: + + 2015-06-10 19:25:50,215 | INFO | IPC Server handler 36 on 25000 | BLOCK* processReport: + blk_1075861877_2121067 on node 10.91.8.218:9866 size 10249 does not belong to any file | + org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1854) + +A log is generated for each deleted block. A file may contain one or more blocks. Therefore, after startup, the NameNode spends a large amount of time printing logs when a large number of files are deleted. As a result, the NameNode startup becomes slow. + +To address this issue, the following operations can be performed to speed up the startup: + +#. After a large number of files are deleted, wait until the DataNode deletes the corresponding blocks and then restart the NameNode. + + You can run the **hdfs dfsadmin -report** command to check the disk space and check whether the files have been deleted. + +#. If a large number of the preceding logs are generated, you can change the NameNode log level to **ERROR** so that the NameNode stops printing such logs. + + After the NameNode is restarted, change the log level back to **INFO**. You do not need to restart the service after changing the log level. diff --git a/doc/component-operation-guide/source/using_hdfs/faq/standby_namenode_fails_to_be_restarted_when_the_system_is_powered_off_during_metadata_namespace_storage.rst b/doc/component-operation-guide/source/using_hdfs/faq/standby_namenode_fails_to_be_restarted_when_the_system_is_powered_off_during_metadata_namespace_storage.rst new file mode 100644 index 0000000..abb487d --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/standby_namenode_fails_to_be_restarted_when_the_system_is_powered_off_during_metadata_namespace_storage.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_1698.html + +.. _mrs_01_1698: + +Standby NameNode Fails to Be Restarted When the System Is Powered off During Metadata (Namespace) Storage +========================================================================================================= + +Question +-------- + +When the standby NameNode is powered off during metadata (namespace) storage, it fails to be started and the following error information is displayed. + +|image1| + +Answer +------ + +When the standby NameNode is powered off during metadata (namespace) storage, it fails to be started and the MD5 file is damaged. Remove the damaged fsimage and start the standby NameNode to rectify the fault. After the rectification, the standby NameNode loads the previous fsimage and reproduces all edits. + +Recovery procedure: + +#. Run the following command to remove the damaged fsimage: + + **rm -rf ${BIGDATA_DATA_HOME}/namenode/current/fsimage_0000000000000096** + +#. Start the standby NameNode. + +.. |image1| image:: /_static/images/en-us_image_0000001349169981.png diff --git a/doc/component-operation-guide/source/using_hdfs/faq/the_hdfs_client_is_unresponsive_when_the_namenode_is_overloaded_for_a_long_time.rst b/doc/component-operation-guide/source/using_hdfs/faq/the_hdfs_client_is_unresponsive_when_the_namenode_is_overloaded_for_a_long_time.rst new file mode 100644 index 0000000..54a59ed --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/the_hdfs_client_is_unresponsive_when_the_namenode_is_overloaded_for_a_long_time.rst @@ -0,0 +1,45 @@ +:original_name: mrs_01_1702.html + +.. _mrs_01_1702: + +The HDFS Client Is Unresponsive When the NameNode Is Overloaded for a Long Time +=============================================================================== + +**Question** +------------ + +When the NameNode node is overloaded (100% of the CPU is occupied), the NameNode is unresponsive. The HDFS clients that are connected to the overloaded NameNode fail to run properly. However, the HDFS clients that are newly connected to the NameNode will be switched to a backup NameNode and run properly. + +**Answer** +---------- + +The default configuration must be used (as described in :ref:`Table 1 `) when the error preceding described occurs: the **keep alive** mechanism is enabled for the RPC connection between the HDFS client and the NameNode. The **keep alive** mechanism will keep the HDFS client waiting for the response from server and prevent the connection from being out timed, causing the unresponsiveness of the HDFS client. + +Perform the following operations to the unresponsive HDFS client: + +- Leave the HDFS client waiting. Once the CPU usage of the node where NameNode locates drops, the NameNode will obtain CPU resources and the HDFS client will receive a response. +- If you do not want to leave the HDFS client running, restart the application where the HDFS client locates to reconnect the HDFS client to another idle NameNode. + +Procedure: + +Configure the following parameters in the **c**\ **ore-site.xml** file on the client. + +.. _mrs_01_1702__tf99cac42ab7947b3bffe186b74e79d38: + +.. table:: **Table 1** Parameter description + + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Default Value | + +=======================+==========================================================================================================================================================================================================================+=======================+ + | ipc.client.ping | If the **ipc.client.ping** parameter is configured to **true**, the HDFS client will wait for the response from the server and periodically send the **ping** message to avoid disconnection caused by **tcp timeout**. | true | + | | | | + | | If the **ipc.client.ping** parameter is configured to **false**, the HDFS client will set the value of **ipc.ping.interval** as the timeout time. If no response is received within that time, timeout occurs. | | + | | | | + | | To avoid the unresponsiveness of HDFS when the NameNode is overloaded for a long time, you are advised to set the parameter to **false**. | | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | ipc.ping.interval | If the value of **ipc.client.ping** is **true**, **ipc.ping.interval** indicates the interval between sending the ping messages. | 60000 | + | | | | + | | If the value of **ipc.client.ping** is **false**, **ipc.ping.interval** indicates the timeout time for connection. | | + | | | | + | | To avoid the unresponsiveness of HDFS when the NameNode is overloaded for a long time, you are advised to set the parameter to a large value, for example **900000** (unit ms) to avoid timeout when the server is busy. | | + +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/faq/this_page_cant_be_displayed_is_displayed_when_internet_explorer_fails_to_access_the_native_hdfs_ui.rst b/doc/component-operation-guide/source/using_hdfs/faq/this_page_cant_be_displayed_is_displayed_when_internet_explorer_fails_to_access_the_native_hdfs_ui.rst new file mode 100644 index 0000000..abeca19 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/this_page_cant_be_displayed_is_displayed_when_internet_explorer_fails_to_access_the_native_hdfs_ui.rst @@ -0,0 +1,30 @@ +:original_name: mrs_01_1708.html + +.. _mrs_01_1708: + +"This page can't be displayed" Is Displayed When Internet Explorer Fails to Access the Native HDFS UI +===================================================================================================== + +Question +-------- + +Occasionally, nternet Explorer 9, Explorer 10, or Explorer 11 fails to access the native HDFS UI. + +Symptom +------- + +Internet Explorer 9, Explorer 10, or Explorer 11 fails to access the native HDFS UI, as shown in the following figure. + +|image1| + +Cause +----- + +Some Internet Explorer 9, Explorer 10, or Explorer 11versions fail to handle SSL handshake issues, causing access failure. + +Solution +-------- + +Refresh the page. + +.. |image1| image:: /_static/images/en-us_image_0000001295931412.jpg diff --git a/doc/component-operation-guide/source/using_hdfs/faq/when_does_a_balance_process_in_hdfs,_shut_down_and_fail_to_be_executed_again.rst b/doc/component-operation-guide/source/using_hdfs/faq/when_does_a_balance_process_in_hdfs,_shut_down_and_fail_to_be_executed_again.rst new file mode 100644 index 0000000..4e96264 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/when_does_a_balance_process_in_hdfs,_shut_down_and_fail_to_be_executed_again.rst @@ -0,0 +1,37 @@ +:original_name: mrs_01_1707.html + +.. _mrs_01_1707: + +When Does a Balance Process in HDFS, Shut Down and Fail to be Executed Again? +============================================================================= + +Question +-------- + +After I start a Balance process in HDFS, the process is shut down abnormally. If I attempt to execute the Balance process again, it fails again. + +Answer +------ + +After a Balance process is executed in HDFS, another Balance process can be executed only after the **/system/balancer.id** file is automatically released. + +However, if a Balance process is shut down abnormally, the **/system/balancer.id** has not been released when the Balance is executed again, which triggers the **append /system/balancer.id** operation. + +- If the time spent on releasing the **/system/balancer.id** file exceeds the soft-limit lease period 60 seconds, executing the Balance process again triggers the append operation, which preempts the lease. The last block is in construction or under recovery status, which triggers the block recovery operation. The **/system/balancer.id** file cannot be closed until the block recovery completes. Therefore, the append operation fails. + + After the **append /system/balancer.id** operation fails, the exception message **RecoveryInProgressException** is displayed. + + .. code-block:: + + org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.protocol.RecoveryInProgressException): Failed to APPEND_FILE /system/balancer.id for DFSClient because lease recovery is in progress. Try again later. + +- If the time spent on releasing the **/system/balancer.id** file is within 60 seconds, the original client continues to own the lease and the exception AlreadyBeingCreatedException occurs and null is returned to the client. The following exception message is displayed on the client: + + .. code-block:: + + java.io.IOException: Cannot create any NameNode Connectors.. Exiting... + +Either of the following methods can be used to solve the problem: + +- Execute the Balance process again after the hard-limit lease period expires for 1 hour, when the original client has released the lease. +- Delete the **/system/balancer.id** file before executing the Balance process again. diff --git a/doc/component-operation-guide/source/using_hdfs/faq/why_are_there_two_standby_namenodes_after_the_active_namenode_is_restarted.rst b/doc/component-operation-guide/source/using_hdfs/faq/why_are_there_two_standby_namenodes_after_the_active_namenode_is_restarted.rst new file mode 100644 index 0000000..f423521 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/why_are_there_two_standby_namenodes_after_the_active_namenode_is_restarted.rst @@ -0,0 +1,44 @@ +:original_name: mrs_01_1706.html + +.. _mrs_01_1706: + +Why are There Two Standby NameNodes After the active NameNode Is Restarted? +=========================================================================== + +Question +-------- + +Why are there two standby NameNodes after the active NameNode is restarted? + +When this problem occurs, check the ZooKeeper and ZooKeeper FC logs. You can find that the sessions used for the communication between the ZooKeeper server and client (ZKFC) are inconsistent. The session ID of the ZooKeeper server is **0x164cb2b3e4b36ae4**, and the session ID of the ZooKeeper FC is **0x144cb2b3e4b36ae4**. Such inconsistency means that the data interaction between the ZooKeeper server and ZKFC fails. + +Content of the ZooKeeper log is as follows: + +.. code-block:: + + 2015-04-15 21:24:54,257 | INFO | CommitProcessor:22 | Established session 0x164cb2b3e4b36ae4 with negotiated timeout 45000 for client /192.168.0.117:44586 | org.apache.zookeeper.server.ZooKeeperServer.finishSessionInit(ZooKeeperServer.java:623) + 2015-04-15 21:24:54,261 | INFO | NIOServerCxn.Factory:192-168-0-114/192.168.0.114:2181 | Successfully authenticated client: authenticationID=hdfs/hadoop@; authorizationID=hdfs/hadoop@. | org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handleAuthorizeCallback(SaslServerCallbackHandler.java:118) + 2015-04-15 21:24:54,261 | INFO | NIOServerCxn.Factory:192-168-0-114/192.168.0.114:2181 | Setting authorizedID: hdfs/hadoop@ | org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handleAuthorizeCallback(SaslServerCallbackHandler.java:134) + 2015-04-15 21:24:54,261 | INFO | NIOServerCxn.Factory:192-168-0-114/192.168.0.114:2181 | adding SASL authorization for authorizationID: hdfs/hadoop@ | org.apache.zookeeper.server.ZooKeeperServer.processSasl(ZooKeeperServer.java:1009) + 2015-04-15 21:24:54,262 | INFO | ProcessThread(sid:22 cport:-1): | Got user-level KeeperException when processing sessionid:0x164cb2b3e4b36ae4 type:create cxid:0x3 zxid:0x20009fafc txntype:-1 reqpath:n/a Error Path:/hadoop-ha/hacluster/ActiveStandbyElectorLock Error:KeeperErrorCode = NodeExists for /hadoop-ha/hacluster/ActiveStandbyElectorLock | org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:648) + +Content of the ZKFC log is as follows: + +.. code-block:: + + 2015-04-15 21:24:54,237 | INFO | main-SendThread(192-168-0-114:2181) | Socket connection established to 192-168-0-114/192.168.0.114:2181, initiating session | org.apache.zookeeper.ClientCnxn$SendThread.primeConnection(ClientCnxn.java:854) + 2015-04-15 21:24:54,257 | INFO | main-SendThread(192-168-0-114:2181) | Session establishment complete on server 192-168-0-114/192.168.0.114:2181, sessionid = 0x144cb2b3e4b36ae4 , negotiated timeout = 45000 | org.apache.zookeeper.ClientCnxn$SendThread.onConnected(ClientCnxn.java:1259) + 2015-04-15 21:24:54,260 | INFO | main-EventThread | EventThread shut down | org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:512) + 2015-04-15 21:24:54,262 | INFO | main-EventThread | Session connected. | org.apache.hadoop.ha.ActiveStandbyElector.processWatchEvent(ActiveStandbyElector.java:547) + 2015-04-15 21:24:54,264 | INFO | main-EventThread | Successfully authenticated to ZooKeeper using SASL. | org.apache.hadoop.ha.ActiveStandbyElector.processWatchEvent(ActiveStandbyElector.java:573) + +Answer +------ + +- Cause Analysis + + After the active NameNode restarts, the temporary node **/hadoop-ha/hacluster/ActiveStandbyElectorLock** created on ZooKeeper is deleted. After the standby NameNode receives that information that the **/hadoop-ha/hacluster/ActiveStandbyElectorLock** node is deleted, the standby NameNode creates the /**hadoop-ha/hacluster/ActiveStandbyElectorLock** node in ZooKeeper in order to switch to the active NameNode. However, when the standby NameNode connects with ZooKeeper through the client ZKFC, the session ID of ZKFC differs from that of ZooKeeper due to network issues, overload CPU, or overload clusters. In this case, the watcher of the standby NameNode fails to detect that the temporary node has been successfully created, and fails to consider the standby NameNode as the active NameNode. After the original active NameNode restarts, it detects that the **/hadoop-ha/hacluster/ActiveStandbyElectorLock** already exists and becomes the standby NameNode. Therefore, both NameNodes are standby NameNodes. + +- Solution + + You are advised to restart two ZKFCs of HDFS on FusionInsight Manager. diff --git a/doc/component-operation-guide/source/using_hdfs/faq/why_data_in_the_buffer_is_lost_if_a_power_outage_occurs_during_storage_of_small_files.rst b/doc/component-operation-guide/source/using_hdfs/faq/why_data_in_the_buffer_is_lost_if_a_power_outage_occurs_during_storage_of_small_files.rst new file mode 100644 index 0000000..0019b08 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/why_data_in_the_buffer_is_lost_if_a_power_outage_occurs_during_storage_of_small_files.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_1699.html + +.. _mrs_01_1699: + +Why Data in the Buffer Is Lost If a Power Outage Occurs During Storage of Small Files +===================================================================================== + +Question +-------- + +Why data in the buffer is lost if a power outage occurs during storage of small files? + +Answer +------ + +Because of a power outage, the blocks in the buffer are not written to the disk immediately after the write operation is completed. To enable synchronization of blocks to the disk, set **dfs.datanode.synconclose** to **true** in the **hdfs-site.xml** file. + +By default, **dfs.datanode.synconclose** is set to **false**. This improves the performance but can cause a buffer data loss in the case of a power outage, and therefore, it is recommended that **dfs.datanode.synconclose** be set to **true** even if this may affect the performance. You can determine whether to enable the synchronization function based on your actual situation. diff --git a/doc/component-operation-guide/source/using_hdfs/faq/why_does_array_border-crossing_occur_during_fileinputformat_split.rst b/doc/component-operation-guide/source/using_hdfs/faq/why_does_array_border-crossing_occur_during_fileinputformat_split.rst new file mode 100644 index 0000000..f5ba885 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/why_does_array_border-crossing_occur_during_fileinputformat_split.rst @@ -0,0 +1,31 @@ +:original_name: mrs_01_1700.html + +.. _mrs_01_1700: + +Why Does Array Border-crossing Occur During FileInputFormat Split? +================================================================== + +Question +-------- + +When HDFS calls the FileInputFormat getSplit method, the ArrayIndexOutOfBoundsException: 0 appears in the following log: + +.. code-block:: + + java.lang.ArrayIndexOutOfBoundsException: 0 + at org.apache.hadoop.mapred.FileInputFormat.identifyHosts(FileInputFormat.java:708) + at org.apache.hadoop.mapred.FileInputFormat.getSplitHostsAndCachedHosts(FileInputFormat.java:675) + at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:359) + at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:210) + at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) + at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) + at scala.Option.getOrElse(Option.scala:120) + at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) + at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) + +Answer +------ + +The elements of each block correspondent frame are as below: /default/rack0/:,/default/rack0/datanodeip:port. + +The problem is due to a block damage or loss, making the block correspondent machine ip and port become null. Use **hdfs fsck** to check the file blocks health state when this problem occurs, and remove damaged block or restore the missing block to re-computing the task. diff --git a/doc/component-operation-guide/source/using_hdfs/faq/why_does_datanode_fail_to_start_when_the_number_of_disks_specified_by_dfs.datanode.data.dir_equals_dfs.datanode.failed.volumes.tolerated.rst b/doc/component-operation-guide/source/using_hdfs/faq/why_does_datanode_fail_to_start_when_the_number_of_disks_specified_by_dfs.datanode.data.dir_equals_dfs.datanode.failed.volumes.tolerated.rst new file mode 100644 index 0000000..4b1b9db --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/why_does_datanode_fail_to_start_when_the_number_of_disks_specified_by_dfs.datanode.data.dir_equals_dfs.datanode.failed.volumes.tolerated.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_1696.html + +.. _mrs_01_1696: + +Why Does DataNode Fail to Start When the Number of Disks Specified by dfs.datanode.data.dir Equals dfs.datanode.failed.volumes.tolerated? +========================================================================================================================================= + +Question +-------- + +If the number of disks specified by **dfs.datanode.data.dir** is equal to the value of **dfs.datanode.failed.volumes.tolerated**, DataNode startup will fail. + +Answer +------ + +By default, the failure of a single disk will cause the HDFS DataNode process to shut down, which results in the NameNode scheduling additional replicas for each block that is present on the DataNode. This causes needless replications of blocks that reside on disks that have not failed. + +To prevent this, you can configure DataNodes to tolerate the failure of dfs.data.dir directories; use the **dfs.datanode.failed.volumes.tolerated** parameter in **hdfs-site.xml.** For example, if the value for this parameter is 3, the DataNode will only shut down after four or more data directories have failed. This value is respected on DataNode startup. + +When we are configuring tolerate volumes which should be always less than the configured volumes or else we can keep this as -1 which is equal to n-1 (where n is number of disks) then DataNode will not be shut down. diff --git a/doc/component-operation-guide/source/using_hdfs/faq/why_does_the_distcp_command_fail_in_the_secure_cluster,_causing_an_exception.rst b/doc/component-operation-guide/source/using_hdfs/faq/why_does_the_distcp_command_fail_in_the_secure_cluster,_causing_an_exception.rst new file mode 100644 index 0000000..fde4f50 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/why_does_the_distcp_command_fail_in_the_secure_cluster,_causing_an_exception.rst @@ -0,0 +1,32 @@ +:original_name: mrs_01_1695.html + +.. _mrs_01_1695: + +Why Does the Distcp Command Fail in the Secure Cluster, Causing an Exception? +============================================================================= + +Question +-------- + +Why distcp command fails in the secure cluster with the following error displayed? + +Client side exception + +.. code-block:: + + Invalid arguments: Unexpected end of file from server + +Server side exception + +.. code-block:: + + javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection? + +Answer +------ + +The preceding error may occur if **webhdfs://** is used in the distcp command. The reason is that the big data cluster uses the HTTPS mechanism, that is, **dfs.http.policy** is set to **HTTPS_ONLY** in **core-site.xml** file. To avoid the error, replace **webhdfs://** with **swebhdfs://** in the file. + +For example: + +**./hadoop distcp** **swebhdfs://IP:PORT/testfile hdfs://IP:PORT/testfile1** diff --git a/doc/component-operation-guide/source/using_hdfs/faq/why_is_java.net.socketexception_no_buffer_space_available_reported_when_data_is_written_to_hdfs.rst b/doc/component-operation-guide/source/using_hdfs/faq/why_is_java.net.socketexception_no_buffer_space_available_reported_when_data_is_written_to_hdfs.rst new file mode 100644 index 0000000..2ec8bed --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/why_is_java.net.socketexception_no_buffer_space_available_reported_when_data_is_written_to_hdfs.rst @@ -0,0 +1,75 @@ +:original_name: mrs_01_1705.html + +.. _mrs_01_1705: + +Why Is "java.net.SocketException: No buffer space available" Reported When Data Is Written to HDFS +================================================================================================== + +Question +-------- + +Why is an "java.net.SocketException: No buffer space available" exception reported when data is written to HDFS? + +This problem occurs when files are written to the HDFS. Check the error logs of the client and DataNode. + +The client logs are as follows: + + +.. figure:: /_static/images/en-us_image_0000001295930408.jpg + :alt: **Figure 1** Client logs + + **Figure 1** Client logs + +DataNode logs are as follows: + +.. code-block:: + + 2017-07-24 20:43:39,269 | ERROR | DataXceiver for client DFSClient_NONMAPREDUCE_996005058_86 + at /192.168.164.155:40214 [Receiving block BP-1287143557-192.168.199.6-1500707719940:blk_1074269754_528941 with io weight 10] | DataNode{data=FSDataset{dirpath='[/srv/BigData/hadoop/data1/dn/current, /srv/BigData/hadoop/data2/dn/current, /srv/BigData/hadoop/data3/dn/current, /srv/BigData/hadoop/data4/dn/current, /srv/BigData/hadoop/data5/dn/current, /srv/BigData/hadoop/data6/dn/current, /srv/BigData/hadoop/data7/dn/current]'}, localName='192-168-164-155:9866', datanodeUuid='a013e29c-4e72-400c-bc7b-bbbf0799604c', xmitsInProgress=0}:Exception transfering block BP-1287143557-192.168.199.6-1500707719940:blk_1074269754_528941 to mirror 192.168.202.99:9866: java.net.SocketException: No buffer space available | DataXceiver.java:870 + 2017-07-24 20:43:39,269 | INFO | DataXceiver for client DFSClient_NONMAPREDUCE_996005058_86 + at /192.168.164.155:40214 [Receiving block BP-1287143557-192.168.199.6-1500707719940:blk_1074269754_528941 with io weight 10] | opWriteBlock BP-1287143557-192.168.199.6-1500707719940:blk_1074269754_528941 received exception java.net.SocketException: No buffer space available | DataXceiver.java:933 + 2017-07-24 20:43:39,270 | ERROR | DataXceiver for client DFSClient_NONMAPREDUCE_996005058_86 + at /192.168.164.155:40214 [Receiving block BP-1287143557-192.168.199.6-1500707719940:blk_1074269754_528941 with io weight 10] | 192-168-164-155:9866:DataXceiver error processing WRITE_BLOCK operation src: /192.168.164.155:40214 dst: /192.168.164.155:9866 | DataXceiver.java:304 java.net.SocketException: No buffer space available + at sun.nio.ch.Net.connect0(Native Method) + at sun.nio.ch.Net.connect(Net.java:454) + at sun.nio.ch.Net.connect(Net.java:446) + at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648) + at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192) + at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) + at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) + at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:800) + at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:138) + at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) + at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:265) + at java.lang.Thread.run(Thread.java:748) + +Answer +------ + +The preceding problem may be caused by network memory exhaustion. + +You can increase the threshold of the network device based on the actual scenario. + +Example: + +.. code-block:: console + + [root@xxxxx ~]# cat /proc/sys/net/ipv4/neigh/default/gc_thresh* + 128 + 512 + 1024 + [root@xxxxx ~]# echo 512 > /proc/sys/net/ipv4/neigh/default/gc_thresh1 + [root@xxxxx ~]# echo 2048 > /proc/sys/net/ipv4/neigh/default/gc_thresh2 + [root@xxxxx ~]# echo 4096 > /proc/sys/net/ipv4/neigh/default/gc_thresh3 + [root@xxxxx ~]# cat /proc/sys/net/ipv4/neigh/default/gc_thresh* + 512 + 2048 + 4096 + +You can also add the following parameters to the **/etc/sysctl.conf** file. The configuration takes effect even if the host is restarted. + +.. code-block:: + + net.ipv4.neigh.default.gc_thresh1 = 512 + net.ipv4.neigh.default.gc_thresh2 = 2048 + net.ipv4.neigh.default.gc_thresh3 = 4096 diff --git a/doc/component-operation-guide/source/using_hdfs/faq/why_is_the_storage_type_of_file_copies_disk_when_the_tiered_storage_policy_is_lazy_persist.rst b/doc/component-operation-guide/source/using_hdfs/faq/why_is_the_storage_type_of_file_copies_disk_when_the_tiered_storage_policy_is_lazy_persist.rst new file mode 100644 index 0000000..b4c7692 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/faq/why_is_the_storage_type_of_file_copies_disk_when_the_tiered_storage_policy_is_lazy_persist.rst @@ -0,0 +1,21 @@ +:original_name: mrs_01_1701.html + +.. _mrs_01_1701: + +Why Is the Storage Type of File Copies DISK When the Tiered Storage Policy Is LAZY_PERSIST? +=========================================================================================== + +Question +-------- + +When the storage policy of the file is set to **LAZY_PERSIST**, the storage type of the first replica should be **RAM_DISK**, and the storage type of other replicas should be **DISK**. + +But why is the storage type of all copies shown as **DISK** actually? + +Answer +------ + +When a user writes into a file whose storage policy is **LAZY_PERSIST**, three replicas are written one by one. The first replica is preferentially written into the DataNode where the client is located. The storage type of all replicas is **DISK** in the following scenarios: + +- If the DataNode where the client is located does not have the RAM disk, the first replica is written into the disk of the DataNode where the client is located, and other replicas are written into the disks of other nodes. +- If the DataNode where the client is located has the RAM disk, and the value of **dfs.datanode.max.locked.memory** is not specified or smaller than the value of **dfs.blocksize**, the first replica is written into the disk of the DataNode where the client is located, and other replicas are written into the disks of other nodes. diff --git a/doc/component-operation-guide/source/using_hdfs/hdfs_performance_tuning/improving_read_performance_using_client_metadata_cache.rst b/doc/component-operation-guide/source/using_hdfs/hdfs_performance_tuning/improving_read_performance_using_client_metadata_cache.rst new file mode 100644 index 0000000..ee55344 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/hdfs_performance_tuning/improving_read_performance_using_client_metadata_cache.rst @@ -0,0 +1,65 @@ +:original_name: mrs_01_1688.html + +.. _mrs_01_1688: + +Improving Read Performance Using Client Metadata Cache +====================================================== + +Scenario +-------- + +Improve the HDFS read performance by using the client to cache the metadata for block locations. + +.. note:: + + This function is recommended only for reading files that are not modified frequently. Because the data modification done on the server side by some other client is invisible to the cache client, which may cause the metadata obtained from the cache to be outdated. + + This section applies to MRS 3.\ *x* or later. + +Procedure +--------- + +**Navigation path for setting parameters:** + +On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **Configurations**, select **All Configurations**, and enter the parameter name in the search box. + +.. table:: **Table 1** Parameter configuration + + +---------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Default Value | + +=======================================+=======================================================================================================================================================================================================================================================================+=======================+ + | dfs.client.metadata.cache.enabled | Enables or disables the client to cache the metadata for block locations. Set this parameter to **true** and use it along with the **dfs.client.metadata.cache.pattern** parameter to enable the cache. | false | + +---------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | dfs.client.metadata.cache.pattern | Indicates the regular expression pattern of the path of the file to be cached. Only the metadata for block locations of these files is cached until the metadata expires. This parameter is valid only when **dfs.client.metadata.cache.enabled** is set to **true**. | ``-`` | + | | | | + | | Example: **/test.\*** indicates that all files whose paths start with **/test** are read. | | + | | | | + | | .. note:: | | + | | | | + | | - To ensure consistency, configure a specific mode to cache only files that are not frequently modified by other clients. | | + | | | | + | | - The regular expression pattern verifies only the path of the URI, but not the schema and authority in the case of the Fully Qualified path. | | + +---------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | dfs.client.metadata.cache.expiry.sec | Indicates the duration for caching metadata. The cache entry becomes invalid after its caching time exceeds this duration. Even metadata that is frequently used during the caching process can become invalid. | 60s | + | | | | + | | Time suffixes **s**/**m**/**h** can be used to indicate second, minute, and hour, respectively. | | + | | | | + | | .. note:: | | + | | | | + | | If this parameter is set to **0s**, the cache function is disabled. | | + +---------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | dfs.client.metadata.cache.max.entries | Indicates the maximum number of non-expired data items that can be cached at a time. | 65536 | + +---------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + +.. note:: + + Call *DFSClient#clearLocatedBlockCache()* to completely clear the client cache before it expires. + + The sample usage is as follows: + + .. code-block:: + + FileSystem fs = FileSystem.get(conf); + DistributedFileSystem dfs = (DistributedFileSystem) fs; + DFSClient dfsClient = dfs.getClient(); + dfsClient.clearLocatedBlockCache(); diff --git a/doc/component-operation-guide/source/using_hdfs/hdfs_performance_tuning/improving_the_connection_between_the_client_and_namenode_using_current_active_cache.rst b/doc/component-operation-guide/source/using_hdfs/hdfs_performance_tuning/improving_the_connection_between_the_client_and_namenode_using_current_active_cache.rst new file mode 100644 index 0000000..c5ec375 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/hdfs_performance_tuning/improving_the_connection_between_the_client_and_namenode_using_current_active_cache.rst @@ -0,0 +1,44 @@ +:original_name: mrs_01_1689.html + +.. _mrs_01_1689: + +Improving the Connection Between the Client and NameNode Using Current Active Cache +=================================================================================== + +Scenario +-------- + +When HDFS is deployed in high availability (HA) mode with multiple NameNode instances, the HDFS client needs to connect to each NameNode in sequence to determine which is the active NameNode and use it for client operations. + +Once the active NameNode is identified, its details can be cached and shared to all clients running on the client host. In this way, each new client first tries to load the details of the active Name Node from the cache and save the RPC call to the standby NameNode, which can help a lot in abnormal scenarios, for example, when the standby NameNode cannot be connected for a long time. + +When a fault occurs and the other NameNode is switched to the active state, the cached details are updated to the information about the current active NameNode. + +.. note:: + + This section applies to MRS 3.\ *x* or later. + +Procedure +--------- + +Navigation path for setting parameters: + +On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **Configurations**, select **All Configurations**, and enter the parameter name in the search box. + +.. table:: **Table 1** Configuration parameters + + +-----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ + | Parameter | Description | Default Value | + +=====================================================+============================================================================================================================================================================================================================================================================================================================================================================================================================================================+=========================================================================+ + | dfs.client.failover.proxy.provider.[nameservice ID] | Client Failover proxy provider class which creates the NameNode proxy using the authenticated protocol. If this parameter is set to **org.apache.hadoop.hdfs.server.namenode.ha.BlackListingFailoverProxyProvider**, you can use the NameNode blacklist feature on the HDFS client. If this parameter is set to **org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider**, you can configure the observer NameNode to process read requests. | org.apache.hadoop.hdfs.server.namenode.ha.AdaptiveFailoverProxyProvider | + +-----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ + | dfs.client.failover.activeinfo.share.flag | Specifies whether to enable the cache function and share the detailed information about the current active NameNode with other clients. Set it to **true** to enable the cache function. | false | + +-----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ + | dfs.client.failover.activeinfo.share.path | Specifies the local directory for storing the shared files created by all clients in the host. If a cache area is to be shared by different users, the directory must have required permissions (for example, creating, reading, and writing cache files in the specified directory). | /tmp | + +-----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ + | dfs.client.failover.activeinfo.share.io.timeout.sec | (Optional) Used to control timeout. The cache file is locked when it is being read or written, and if the file cannot be locked within the specified time, the attempt to read or update the caches will be abandoned. The unit is second. | 5 | + +-----------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ + +.. note:: + + The cache files created by the HDFS client are reused by other clients, and thus these files will not be deleted from the local system. If this function is disabled, you may need to manually clear the data. diff --git a/doc/component-operation-guide/source/using_hdfs/hdfs_performance_tuning/improving_write_performance.rst b/doc/component-operation-guide/source/using_hdfs/hdfs_performance_tuning/improving_write_performance.rst new file mode 100644 index 0000000..e81aa2f --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/hdfs_performance_tuning/improving_write_performance.rst @@ -0,0 +1,44 @@ +:original_name: mrs_01_1687.html + +.. _mrs_01_1687: + +Improving Write Performance +=========================== + +Scenario +-------- + +Improve the HDFS write performance by modifying the HDFS attributes. + +.. note:: + + This section applies to MRS 3.\ *x* or later. + +Procedure +--------- + +Navigation path for setting parameters: + +On FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **Configurations** and select **All Configurations**. Enter a parameter name in the search box. + +.. table:: **Table 1** Parameters for improving HDFS write performance + + +--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Default Value | + +======================================+=======================================================================================================================================================================================================================================================================================================================================+=======================+ + | dfs.datanode.drop.cache.behind.reads | Specifies whether to enable a DataNode to automatically clear all data in the cache after the data in the cache is transferred to the client. | false | + | | | | + | | - **true**: The cached data is discarded. This parameter needs to be configured on the DataNode. | | + | | | | + | | You are advised to set it to **true** if data is repeatedly read only a few times, so that the cache can be used by other operations. | | + | | | | + | | - **false**: You are advised to set it to **false** if data is read repeatedly for many times to improve the read speed. | | + | | | | + | | .. note:: | | + | | | | + | | This parameter is optional for improving write performance. You can configure it as needed. | | + +--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | dfs.client-write-packet-size | Specifies the size of the client write packet. When the HDFS client writes data to the DataNode, the data will be accumulated until a packet is generated. Then, the packet is transmitted over the network. This parameter specifies the size (unit: byte) of the data packet to be transmitted, which can be specified by each job. | 262144 | + | | | | + | | In the 10-Gigabit network, you can increase the value of this parameter to enhance the transmission throughput. | | + +--------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/hdfs_performance_tuning/index.rst b/doc/component-operation-guide/source/using_hdfs/hdfs_performance_tuning/index.rst new file mode 100644 index 0000000..f4d2a86 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/hdfs_performance_tuning/index.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_0829.html + +.. _mrs_01_0829: + +HDFS Performance Tuning +======================= + +- :ref:`Improving Write Performance ` +- :ref:`Improving Read Performance Using Client Metadata Cache ` +- :ref:`Improving the Connection Between the Client and NameNode Using Current Active Cache ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + improving_write_performance + improving_read_performance_using_client_metadata_cache + improving_the_connection_between_the_client_and_namenode_using_current_active_cache diff --git a/doc/component-operation-guide/source/using_hdfs/index.rst b/doc/component-operation-guide/source/using_hdfs/index.rst new file mode 100644 index 0000000..3a4f82c --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/index.rst @@ -0,0 +1,72 @@ +:original_name: mrs_01_0790.html + +.. _mrs_01_0790: + +Using HDFS +========== + +- :ref:`Configuring Memory Management ` +- :ref:`Creating an HDFS Role ` +- :ref:`Using the HDFS Client ` +- :ref:`Running the DistCp Command ` +- :ref:`Overview of HDFS File System Directories ` +- :ref:`Changing the DataNode Storage Directory ` +- :ref:`Configuring HDFS Directory Permission ` +- :ref:`Configuring NFS ` +- :ref:`Planning HDFS Capacity ` +- :ref:`Configuring ulimit for HBase and HDFS ` +- :ref:`Balancing DataNode Capacity ` +- :ref:`Configuring Replica Replacement Policy for Heterogeneous Capacity Among DataNodes ` +- :ref:`Configuring the Number of Files in a Single HDFS Directory ` +- :ref:`Configuring the Recycle Bin Mechanism ` +- :ref:`Setting Permissions on Files and Directories ` +- :ref:`Setting the Maximum Lifetime and Renewal Interval of a Token ` +- :ref:`Configuring the Damaged Disk Volume ` +- :ref:`Configuring Encrypted Channels ` +- :ref:`Reducing the Probability of Abnormal Client Application Operation When the Network Is Not Stable ` +- :ref:`Configuring the NameNode Blacklist ` +- :ref:`Optimizing HDFS NameNode RPC QoS ` +- :ref:`Optimizing HDFS DataNode RPC QoS ` +- :ref:`Configuring Reserved Percentage of Disk Usage on DataNodes ` +- :ref:`Configuring HDFS NodeLabel ` +- :ref:`Using HDFS AZ Mover ` +- :ref:`Configuring the Observer NameNode to Process Read Requests ` +- :ref:`Performing Concurrent Operations on HDFS Files ` +- :ref:`Introduction to HDFS Logs ` +- :ref:`HDFS Performance Tuning ` +- :ref:`FAQ ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + configuring_memory_management + creating_an_hdfs_role + using_the_hdfs_client + running_the_distcp_command + overview_of_hdfs_file_system_directories + changing_the_datanode_storage_directory + configuring_hdfs_directory_permission + configuring_nfs + planning_hdfs_capacity + configuring_ulimit_for_hbase_and_hdfs + balancing_datanode_capacity + configuring_replica_replacement_policy_for_heterogeneous_capacity_among_datanodes + configuring_the_number_of_files_in_a_single_hdfs_directory + configuring_the_recycle_bin_mechanism + setting_permissions_on_files_and_directories + setting_the_maximum_lifetime_and_renewal_interval_of_a_token + configuring_the_damaged_disk_volume + configuring_encrypted_channels + reducing_the_probability_of_abnormal_client_application_operation_when_the_network_is_not_stable + configuring_the_namenode_blacklist + optimizing_hdfs_namenode_rpc_qos + optimizing_hdfs_datanode_rpc_qos + configuring_reserved_percentage_of_disk_usage_on_datanodes + configuring_hdfs_nodelabel + using_hdfs_az_mover + configuring_the_observer_namenode_to_process_read_requests + performing_concurrent_operations_on_hdfs_files + introduction_to_hdfs_logs + hdfs_performance_tuning/index + faq/index diff --git a/doc/component-operation-guide/source/using_hdfs/introduction_to_hdfs_logs.rst b/doc/component-operation-guide/source/using_hdfs/introduction_to_hdfs_logs.rst new file mode 100644 index 0000000..0de992f --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/introduction_to_hdfs_logs.rst @@ -0,0 +1,127 @@ +:original_name: mrs_01_0828.html + +.. _mrs_01_0828: + +Introduction to HDFS Logs +========================= + +Log Description +--------------- + +**Log path**: The default path of HDFS logs is **/var/log/Bigdata/hdfs/**\ *Role name*. + +- NameNode: **/var/log/Bigdata/hdfs/nn** (run logs) and **/var/log/Bigdata/audit/hdfs/nn** (audit logs) +- DataNode: **/var/log/Bigdata/hdfs/dn** (run logs) and **/var/log/Bigdata/audit/hdfs/dn** (audit logs) +- ZKFC: **/var/log/Bigdata/hdfs/zkfc** (run logs) and **/var/log/Bigdata/audit/hdfs/zkfc** (audit logs) +- JournalNode: **/var/log/Bigdata/hdfs/jn** (run logs) and **/var/log/Bigdata/audit/hdfs/jn** (audit logs) +- Router: **/var/log/Bigdata/hdfs/router** (run logs) and **/var/log/Bigdata/audit/hdfs/router** (audit logs) +- HttpFS: **/var/log/Bigdata/hdfs/httpfs** (run logs) and **/var/log/Bigdata/audit/hdfs/httpfs** (audit logs) + +**Log archive rule**: The automatic HDFS log compression function is enabled. By default, when the size of logs exceeds 100 MB, logs are automatically compressed into a log file named in the following format: *---.log | HDFS system log, which records most of the logs generated when the HDFS system is running. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hadoop---.out | Log that records the HDFS running environment information. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hadoop.log | Log that records the operation of the Hadoop client. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hdfs-period-check.log | Log that records scripts that are executed periodically, including automatic balancing, data migration, and JournalNode data synchronization detection. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ----gc.log | Garbage collection log file | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | postinstallDetail.log | Work log before the HDFS service startup and after the installation. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hdfs-service-check.log | Log that records whether the HDFS service starts successfully. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hdfs-set-storage-policy.log | Log that records the HDFS data storage policies. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | cleanupDetail.log | Log that records the cleanup logs about the uninstallation of the HDFS service. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | prestartDetail.log | Log that records cluster operations before the HDFS service startup. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hdfs-recover-fsimage.log | Recovery log of the NameNode metadata. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | datanode-disk-check.log | Log that records the disk status check during the cluster installation and use. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hdfs-availability-check.log | Log that check whether the HDFS service is available. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hdfs-backup-fsimage.log | Backup log of the NameNode metadata. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | startDetail.log | Detailed log that records the HDFS service startup. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hdfs-blockplacement.log | Log that records the placement policy of HDFS blocks. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | upgradeDetail.log | Upgrade logs. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hdfs-clean-acls-java.log | Log that records the clearing of deleted roles' ACL information by HDFS. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hdfs-haCheck.log | Run log that checks whether the NameNode in active or standby state has obtained scripts. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | -jvmpause.log | Log that records JVM pauses during process running. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hadoop--balancer-.log | Run log of HDFS automatic balancing. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hadoop--balancer-.out | Log that records information of the environment where HDFS executes automatic balancing. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hdfs-switch-namenode.log | Run log that records the HDFS active/standby switchover. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | hdfs-router-admin.log | Run log of the mount table management operation | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Tomcat logs | hadoop-omm-host1.out, httpfs-catalina..log, httpfs-host-manager..log, httpfs-localhost..log, httpfs-manager..log, localhost_access_web_log.log | Tomcat run log | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Audit log | hdfs-audit-.log | Audit log that records the HDFS operations (such as creating, deleting, modifying and querying files). | + | | | | + | | ranger-plugin-audit.log | | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | SecurityAuth.audit | HDFS security audit log. | + +-----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Log Level +--------- + +:ref:`Table 2 ` lists the log levels supported by HDFS. The log levels include FATAL, ERROR, WARN, INFO, and DEBUG. Logs of which the levels are higher than or equal to the set level will be printed by programs. The higher the log level is set, the fewer the logs are recorded. + +.. _mrs_01_0828__t9a69df8da9a84f41bb6fd3e008d7a3b8: + +.. table:: **Table 2** Log levels + + ===== ============================================================== + Level Description + ===== ============================================================== + FATAL Indicates the critical error information about system running. + ERROR Indicates the error information about system running. + WARN Indicates that the current event processing exists exceptions. + INFO Indicates that the system and events are running properly. + DEBUG Indicates the system and system debugging information. + ===== ============================================================== + +To modify log levels, perform the following operations: + +#. Go to the **All Configurations** page of HDFS by referring to :ref:`Modifying Cluster Service Configuration Parameters `. +#. On the left menu bar, select the log menu of the target role. +#. Select a desired log level. +#. Save the configuration. In the displayed dialog box, click **OK** to make the configurations take effect. + + .. note:: + + The configurations take effect immediately without restarting the service. + +Log Formats +----------- + +The following table lists the HDFS log formats. + +.. table:: **Table 3** Log formats + + +-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Type | Format | Example | + +===========+========================================================================================================================================================+========================================================================================================================================================================================================================================================================================================================================================================================+ + | Run log | <*yyyy-MM-dd HH:mm:ss,SSS*>|<*Log level*>|<*Name of the thread that generates the log*>|<*Message in the log*>|<*Location where the log event occurs*> | 2015-01-26 18:43:42,840 \| INFO \| IPC Server handler 40 on 8020 \| Rolling edit logs \| org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1096) | + +-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Audit log | <*yyyy-MM-dd HH:mm:ss,SSS*>|<*Log level*>|<*Name of the thread that generates the log*>|<*Message in the log*>|<*Location where the log event occurs*> | 2015-01-26 18:44:42,607 \| INFO \| IPC Server handler 32 on 8020 \| allowed=true ugi=hbase (auth:SIMPLE) ip=/10.177.112.145 cmd=getfileinfo src=/hbase/WALs/hghoulaslx410,16020,1421743096083/hghoulaslx410%2C16020%2C1421743096083.1422268722795 dst=null perm=null \| org.apache.hadoop.hdfs.server.namenode.FSNamesystem$DefaultAuditLogger.logAuditMessage(FSNamesystem.java:7950) | + +-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/optimizing_hdfs_datanode_rpc_qos.rst b/doc/component-operation-guide/source/using_hdfs/optimizing_hdfs_datanode_rpc_qos.rst new file mode 100644 index 0000000..999e8bd --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/optimizing_hdfs_datanode_rpc_qos.rst @@ -0,0 +1,30 @@ +:original_name: mrs_01_1673.html + +.. _mrs_01_1673: + +Optimizing HDFS DataNode RPC QoS +================================ + +Scenario +-------- + +When the speed at which the client writes data to the HDFS is greater than the disk bandwidth of the DataNode, the disk bandwidth is fully occupied. As a result, the DataNode does not respond. The client can back off only by canceling or restoring the channel, which results in write failures and unnecessary channel recovery operations. + +.. note:: + + This section applies to MRS 3.\ *x* or later. + +Configuration +------------- + +The new configuration parameter **dfs.pipeline.ecn** is introduced. When this configuration is enabled, the DataNode sends a signal from the write channel when the write channel is overloaded. The client may perform backoff based on the blocking signal to prevent the system from being overloaded. This configuration parameter is introduced to make the channel more stable and reduce unnecessary cancellation or recovery operations. After receiving the signal, the client backs off for a period of time (5,000 ms), and then adjusts the backoff time based on the related filter (the maximum backoff time is 50,000 ms). + +Go to the **All Configurations** page of HDFS and enter a parameter name in the search box by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + +.. table:: **Table 1** DN ECN configuration + + +------------------+----------------------------------------------------------------------------------+---------------+ + | Parameter | Description | Default Value | + +==================+==================================================================================+===============+ + | dfs.pipeline.ecn | After configuration, the DataNode can send blocking notifications to the client. | false | + +------------------+----------------------------------------------------------------------------------+---------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/optimizing_hdfs_namenode_rpc_qos.rst b/doc/component-operation-guide/source/using_hdfs/optimizing_hdfs_namenode_rpc_qos.rst new file mode 100644 index 0000000..9b1260b --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/optimizing_hdfs_namenode_rpc_qos.rst @@ -0,0 +1,78 @@ +:original_name: mrs_01_1672.html + +.. _mrs_01_1672: + +Optimizing HDFS NameNode RPC QoS +================================ + +Scenarios +--------- + +.. note:: + + This section applies to MRS 3.\ *x* or later. + +Several finished Hadoop clusters are faulty because the NameNode is overloaded and unresponsive. + +Such problem is caused by the initial design of Hadoop: In Hadoop, the NameNode functions as an independent part and in its namespace coordinates various HDFS operations, including obtaining the data block location, listing directories, and creating files. The NameNode receives HDFS operations, regards them as RPC calls, and places them in the FIFO call queue for read threads to process. Requests in FIFO call queue are served first-in first-out. However, users who perform more I/O operations are served more time than those performing fewer I/O operations. In this case, the FIFO is unfair and causes the delay. + + +.. figure:: /_static/images/en-us_image_0000001296250312.png + :alt: **Figure 1** NameNode request processing based on the FIFO call queue + + **Figure 1** NameNode request processing based on the FIFO call queue + +The unfair problem and delaying mentioned before can be improved by replacing the FIFO queue with a new type of queue called FairCallQueue. In this way, FAIR queues assign incoming RPC calls to multiple queues based on the scale of the caller's call. The scheduling module tracks the latest calls and assigns a higher priority to users with a smaller number of calls. + + +.. figure:: /_static/images/en-us_image_0000001349289997.png + :alt: **Figure 2** NameNode request processing based on FAIRCallQueue + + **Figure 2** NameNode request processing based on FAIRCallQueue + +Configuration Description +------------------------- + +- FairCallQueue ensures quality of service (QoS) by internally adjusting the order in which RPCs are invoked. + + This queue consists of the following parts: + + #. DecayRpcScheduler: used to provide priority values from 0 to N (the value 0 indicates the highest priority). + #. Multi-level queues (located in the FairCallQueue): used to ensure that queues are invoked in order of priority. + #. Multi-channel converters (provided with Weighted Round Robin Multiplexer): used to provide logic control for queue selection. + + After the FairCallQueue is configured, the control module determines the sub-queue to which the received invoking is allocated. The current scheduling module is DecayRpcScheduler, which only continuously tracks the priority numbers of various calls and periodically reduces these numbers. + + Go to the **All Configurations** page of HDFS and enter a parameter name in the search box by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + + .. table:: **Table 1** FairCallQueue parameters + + +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------+ + | Parameter | Description | Default Value | + +===============================+==========================================================================================================================================+==========================================+ + | ipc.\ **.callqueue.impl | Specifies the queue implementation class. You need to run the **org.apache.hadoop.ipc.FairCallQueue** command to enable the QoS feature. | java.util.concurrent.LinkedBlockingQueue | + +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------+ + +- RPC BackOff + + Backoff is one of the FairCallQueue functions. It requires the client to retry operations (such as creating, deleting, and opening a file) after a period of time. When the backoff occurs, the RCP server throws RetriableException. The FairCallQueue performs backoff in either of the following cases: + + - The queue is full, that is, there are many client calls in the queue. + - The queue response time is longer than the threshold time (specified by the **ipc..decay-scheduler.backoff.responsetime.thresholds** parameter). + + .. table:: **Table 2** RPC Backoff configuration + + +----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+ + | Parameter | Description | Default Value | + +================================================================+==============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+=========================+ + | ipc.\ **.backoff.enable | Specifies whether to enable the backoff. When the current application contains a large number of user callings, the RPC request is blocked if the connection limit of the operating system is not reached. Alternatively, when the RPC or NameNode is heavily loaded, some explicit exceptions can be thrown back to the client based on certain policies. The client can understand these exceptions and perform exponential rollback, which is another implementation of the RetryInvocationHandler class. | false | + +----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+ + | ipc.\ **.decay-scheduler.backoff.responsetime.enable | Indicate whether to enable the backoff based on the average queue response time. | false | + +----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+ + | ipc.\ **.decay-scheduler.backoff.responsetime.thresholds | Configure the response time threshold for each queue. The response time threshold must match the number of priorities (the value of **ipc. .faircallqueue.priority-levels**). Unit: millisecond | 10000,20000,30000,40000 | + +----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+ + +.. note:: + + - **** indicates the RPC port configured on the NameNode. + - The backoff function based on the response time takes effect only when **ipc. .backoff.enable** is set to **true**. diff --git a/doc/component-operation-guide/source/using_hdfs/overview_of_hdfs_file_system_directories.rst b/doc/component-operation-guide/source/using_hdfs/overview_of_hdfs_file_system_directories.rst new file mode 100644 index 0000000..ab57e4d --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/overview_of_hdfs_file_system_directories.rst @@ -0,0 +1,148 @@ +:original_name: mrs_01_0795.html + +.. _mrs_01_0795: + +Overview of HDFS File System Directories +======================================== + +This section describes the directory structure in HDFS, as shown in the following table. + +.. table:: **Table 1** HDFS directory structure (applicable to versions earlier than MRS 3.x) + + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | Path | Type | Function | Whether the Directory Can Be Deleted | Deletion Consequence | + +====================================================+=================================+==========================================================================================================================================================================================================================================================================================================================================================================================+======================================+===============================================================================+ + | /tmp/spark/sparkhive-scratch | Fixed directory | Stores temporary files of metastore sessions in Spark JDBCServer. | No | Failed to run the task. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/sparkhive-scratch | Fixed directory | Stores temporary files of metastore session that are executed using Spark CLI. | No | Failed to run the task. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/carbon/ | Fixed directory | Stores the abnormal data in this directory if abnormal CarbonData data exists during data import. | Yes | Error data is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/Loader-${*Job name*}_${*MR job ID*} | Temporary directory | Stores the region information about Loader HBase bulkload jobs. The data is automatically deleted after the job running is completed. | No | Failed to run the Loader HBase Bulkload job. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/logs | Fixed directory | Stores the collected MR task logs. | Yes | MR task logs are lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/archived | Fixed directory | Archives the MR task logs on HDFS. | Yes | MR task logs are lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/hadoop-yarn/staging | Fixed directory | Stores the run logs, summary information, and configuration attributes of ApplicationMaster running jobs. | No | Services are running improperly. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/hadoop-yarn/staging/history/done_intermediate | Fixed directory | Stores temporary files in the **/tmp/hadoop-yarn/staging** directory after all tasks are executed. | No | MR task logs are lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/hadoop-yarn/staging/history/done | Fixed directory | The periodic scanning thread periodically moves the **done_intermediate** log file to the **done** directory. | No | MR task logs are lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/mr-history | Fixed directory | Stores the historical record files that are pre-loaded. | No | Historical MR task log data is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/hive | Fixed directory | Stores Hive temporary files. | No | Failed to run the Hive task. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/hive-scratch | Fixed directory | Stores temporary data (such as session information) generated during Hive running. | No | Failed to run the current task. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/{user}/.sparkStaging | Fixed directory | Stores temporary files of the SparkJDBCServer application. | No | Failed to start the executor. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/spark/jars | Fixed directory | Stores running dependency packages of the Spark executor. | No | Failed to start the executor. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/loader | Fixed directory | Stores dirty data of Loader jobs and data of HBase jobs. | No | Failed to execute the HBase job. Or dirty data is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/loader/etl_dirty_data_dir | | | | | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/loader/etl_hbase_putlist_tmp | | | | | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/loader/etl_hbase_tmp | | | | | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/mapred | Fixed directory | Stores Hadoop-related files. | No | Failed to start Yarn. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/hive | Fixed directory | Stores Hive-related data by default, including the depended Spark lib package and default table data storage path. | No | User data is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/omm-bulkload | Temporary directory | Stores HBase batch import tools temporarily. | No | Failed to import HBase tasks in batches. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/hbase | Temporary directory | Stores HBase batch import tools temporarily. | No | Failed to import HBase tasks in batches. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /sparkJobHistory | Fixed directory | Stores Spark event log data. | No | The History Server service is unavailable, and the task fails to be executed. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /flume | Fixed directory | Stores data collected by Flume from HDFS. | No | Flume runs improperly. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /mr-history/tmp | Fixed directory | Stores logs generated by MapReduce jobs. | Yes | Log information is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /mr-history/done | Fixed directory | Stores logs managed by MR JobHistory Server. | Yes | Log information is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tenant | Created when a tenant is added. | Directory of a tenant in the HDFS. By default, the system automatically creates a folder in the **/tenant** directory based on the tenant name. For example, the default HDFS storage directory for **ta1** is **tenant/ta1**. When a tenant is created for the first time, the system creates the **/tenant** directory in the HDFS root directory. You can customize the storage path. | No | The tenant account is unavailable. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /apps{1~5}/ | Fixed directory | Stores the Hive package used by WebHCat. | No | Failed to run the WebHCat tasks. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /hbase | Fixed directory | Stores HBase data. | No | HBase user data is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /hbaseFileStream | Fixed directory | Stores HFS files. | No | The HFS file is lost and cannot be restored. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /ats/active | Fixed directory | HDFS path used to store the timeline data of running applications. | No | Failed to run the **tez** task after the directory deletion. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /ats/done | Fixed directory | HDFS path used to store the timeline data of completed applications. | No | Automatically created after the deletion. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /flink | Fixed directory | Stores the checkpoint task data. | No | Failed to run tasks after the deletion. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + +.. table:: **Table 2** Directory structure of the HDFS file system (applicable to MRS 3.x or later) + + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | Path | Type | Function | Whether the Directory Can Be Deleted | Deletion Consequence | + +====================================================+=================================+==========================================================================================================================================================================================================================================================================================================================================================================================+======================================+===============================================================================+ + | /tmp/spark2x/sparkhive-scratch | Fixed directory | Stores temporary files of metastore session in Spark2x JDBCServer. | No | Failed to run the task. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/sparkhive-scratch | Fixed directory | Stores temporary files of metastore sessions that are executed in CLI mode using Spark2x CLI. | No | Failed to run the task. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/logs/ | Fixed directory | Stores container log files. | Yes | Container log files cannot be viewed. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/carbon/ | Fixed directory | Stores the abnormal data in this directory if abnormal CarbonData data exists during data import. | Yes | Error data is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/Loader-${*Job name*}_${*MR job ID*} | Temporary directory | Stores the region information about Loader HBase bulkload jobs. The data is automatically deleted after the job running is completed. | No | Failed to run the Loader HBase Bulkload job. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/hadoop-omm/yarn/system/rmstore | Fixed directory | Stores the ResourceManager running information. | Yes | Status information is lost after ResourceManager is restarted. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/archived | Fixed directory | Archives the MR task logs on HDFS. | Yes | MR task logs are lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/hadoop-yarn/staging | Fixed directory | Stores the run logs, summary information, and configuration attributes of ApplicationMaster running jobs. | No | Services are running improperly. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/hadoop-yarn/staging/history/done_intermediate | Fixed directory | Stores temporary files in the **/tmp/hadoop-yarn/staging** directory after all tasks are executed. | No | MR task logs are lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/hadoop-yarn/staging/history/done | Fixed directory | The periodic scanning thread periodically moves the **done_intermediate** log file to the **done** directory. | No | MR task logs are lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/mr-history | Fixed directory | Stores the historical record files that are pre-loaded. | No | Historical MR task log data is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tmp/hive-scratch | Fixed directory | Stores temporary data (such as session information) generated during Hive running. | No | Failed to run the current task. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/{user}/.sparkStaging | Fixed directory | Stores temporary files of the SparkJDBCServer application. | No | Failed to start the executor. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/spark2x/jars | Fixed directory | Stores running dependency packages of the Spark2x executor. | No | Failed to start the executor. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/loader | Fixed directory | Stores dirty data of Loader jobs and data of HBase jobs. | No | Failed to execute the HBase job. Or dirty data is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/loader/etl_dirty_data_dir | | | | | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/loader/etl_hbase_putlist_tmp | | | | | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/loader/etl_hbase_tmp | | | | | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/oozie | Fixed directory | Stores dependent libraries required for Oozie running, which needs to be manually uploaded. | No | Failed to schedule Oozie. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/mapred/hadoop-mapreduce-*3.1.1*.tar.gz | Fixed files | Stores JAR files used by the distributed MR cache. | No | The MR distributed cache function is unavailable. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/hive | Fixed directory | Stores Hive-related data by default, including the depended Spark lib package and default table data storage path. | No | User data is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/omm-bulkload | Temporary directory | Stores HBase batch import tools temporarily. | No | Failed to import HBase tasks in batches. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /user/hbase | Temporary directory | Stores HBase batch import tools temporarily. | No | Failed to import HBase tasks in batches. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /spark2xJobHistory2x | Fixed directory | Stores Spark2x eventlog data. | No | The History Server service is unavailable, and the task fails to be executed. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /flume | Fixed directory | Stores data collected by Flume from HDFS. | No | Flume runs improperly. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /mr-history/tmp | Fixed directory | Stores logs generated by MapReduce jobs. | Yes | Log information is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /mr-history/done | Fixed directory | Stores logs managed by MR JobHistory Server. | Yes | Log information is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /tenant | Created when a tenant is added. | Directory of a tenant in the HDFS. By default, the system automatically creates a folder in the **/tenant** directory based on the tenant name. For example, the default HDFS storage directory for **ta1** is **tenant/ta1**. When a tenant is created for the first time, the system creates the **/tenant** directory in the HDFS root directory. You can customize the storage path. | No | The tenant account is unavailable. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /apps{1~5}/ | Fixed directory | Stores the Hive package used by WebHCat. | No | Failed to run the WebHCat tasks. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /hbase | Fixed directory | Stores HBase data. | No | HBase user data is lost. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ + | /hbaseFileStream | Fixed directory | Stores HFS files. | No | The HFS file is lost and cannot be restored. | + +----------------------------------------------------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/performing_concurrent_operations_on_hdfs_files.rst b/doc/component-operation-guide/source/using_hdfs/performing_concurrent_operations_on_hdfs_files.rst new file mode 100644 index 0000000..2e83db3 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/performing_concurrent_operations_on_hdfs_files.rst @@ -0,0 +1,105 @@ +:original_name: mrs_01_1684.html + +.. _mrs_01_1684: + +Performing Concurrent Operations on HDFS Files +============================================== + +Scenario +-------- + +Performing this operation can concurrently modify file and directory permissions and access control tools in a cluster. + +.. note:: + + This section applies to MRS 3.\ *x* or later clusters. + +Impact on the System +-------------------- + +Performing concurrent file modification operations in a cluster has adverse impacts on the cluster performance. Therefore, you are advised to do so when the cluster is idle. + +Prerequisites +------------- + +- The HDFS client or clients including HDFS has been installed. For example, the installation directory is **/opt/client**. +- Service component users are created by the administrator as required. In security mode, machine-machine users need to download the keytab file. A human-machine user needs to change the password upon the first login. (This operation is not required in normal mode.) + +Procedure +--------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If the cluster is in security mode, the user executing the DistCp command must belong to the **supergroup** group and run the following command to perform user authentication. In normal mode, user authentication is not required. + + **kinit** *Component service user* + +#. Increase the JVM size of the client to prevent out of memory (OOM). (32 GB is recommended for 100 million files.) + + .. note:: + + The HDFS client exits abnormally and the error message "java.lang.OutOfMemoryError" is displayed after the HDFS client command is executed. + + This problem occurs because the memory required for running the HDFS client exceeds the preset upper limit (128 MB by default). You can change the memory upper limit of the client by modifying **CLIENT_GC_OPTS** in **\ **/HDFS/component_env**. For example, if you want to set the upper limit to 1 GB, run the following command: + + CLIENT_GC_OPTS="-Xmx1G" + + After the modification, run the following command to make the modification take effect: + + **source** <*Client installation path*>/**/bigdata_env** + +#. Run the concurrent commands shown in the following table. + + +----------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ + | Command | Description | Function | + +================================================================================================================+===========================================================================================================================+============================================================================+ + | hdfs quickcmds [-t threadsNumber] [-p principal] [-k keytab] -setrep ... | **threadsNumber** indicates the number of concurrent threads. The default value is the number of vCPUs of the local host. | Used to concurrently set the number of copies of all files in a directory. | + | | | | + | | **principal** indicates the Kerberos user. | | + | | | | + | | **keytab** indicates the Keytab file. | | + | | | | + | | **rep** indicates the number of replicas. | | + | | | | + | | **path** indicates the HDFS directory. | | + +----------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ + | hdfs quickcmds [-t threadsNumber] [-p principal] [-k keytab] -chown [owner][:[group]] ... | **threadsNumber** indicates the number of concurrent threads. The default value is the number of vCPUs of the local host. | Used to concurrently set the owner group of all files in the directory. | + | | | | + | | **principal** indicates the Kerberos user. | | + | | | | + | | **keytab** indicates the Keytab file. | | + | | | | + | | **owner** indicates the owner. | | + | | | | + | | **group** indicates the group to which the user belongs. | | + | | | | + | | **path** indicates the HDFS directory. | | + +----------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ + | hdfs quickcmds [-t threadsNumber] [-p principal] [-k keytab] -chmod ... | **threadsNumber** indicates the number of concurrent threads. The default value is the number of vCPUs of the local host. | Used to concurrently set permissions for all files in a directory. | + | | | | + | | **principal** indicates the Kerberos user. | | + | | | | + | | **keytab** indicates the Keytab file. | | + | | | | + | | **mode** indicates the permission (for example, 754). | | + | | | | + | | **path** indicates the HDFS directory. | | + +----------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ + | hdfs quickcmds [-t threadsNumber] [-p principal] [-k keytab] -setfacl [{-b|-k} {-m|-x } ``...]|[--``\ set ...] | **threadsNumber** indicates the number of concurrent threads. The default value is the number of vCPUs of the local host. | Used to concurrently set ACL information for all files in a directory. | + | | | | + | | **principal** indicates the Kerberos user. | | + | | | | + | | **keytab** indicates the Keytab file. | | + | | | | + | | **acl_spec** indicates the ACL list separated by commas (,). | | + | | | | + | | **path** indicates the HDFS directory. | | + +----------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/planning_hdfs_capacity.rst b/doc/component-operation-guide/source/using_hdfs/planning_hdfs_capacity.rst new file mode 100644 index 0000000..15ed2fe --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/planning_hdfs_capacity.rst @@ -0,0 +1,135 @@ +:original_name: mrs_01_0799.html + +.. _mrs_01_0799: + +Planning HDFS Capacity +====================== + +In HDFS, DataNode stores user files and directories as blocks, and file objects are generated on the NameNode to map each file, directory, and block on the DataNode. + +The file objects on the NameNode require certain memory capacity. The memory consumption linearly increases as more file objects generated. The number of file objects on the NameNode increases and the objects consume more memory when the files and directories stored on the DataNode increase. In this case, the existing hardware may not meet the service requirement and the cluster is difficult to be scaled out. + +Capacity planning of the HDFS that stores a large number of files is to plan the capacity specifications of the NameNode and DataNode and to set parameters according to the capacity plans. + +Capacity Specifications +----------------------- + +- NameNode capacity specifications + + Each file object on the NameNode corresponds to a file, directory, or block on the DataNode. + + A file uses at least one block. The default size of a block is **134,217,728**, that is, 128 MB, which can be set in the **dfs.blocksize** parameter. By default, a file whose size is less than 128 MB occupies only one block. If the file size is greater than 128 MB, the number of occupied blocks is the file size divided by 128 MB (Number of occupied blocks = File size/128). The directories do not occupy any blocks. + + Based on **dfs.blocksize**, the number of file objects on the NameNode is calculated as follows: + + .. table:: **Table 1** Number of NameNode file objects + + +--------------------------------+---------------------------------------------------------+ + | Size of a File | Number of File Objects | + +================================+=========================================================+ + | < 128 MB | 1 (File) + 1 (Block) = 2 | + +--------------------------------+---------------------------------------------------------+ + | > 128 MB (for example, 128 GB) | 1 (File) + 1,024 (128 GB/128 MB = 1,024 blocks) = 1,025 | + +--------------------------------+---------------------------------------------------------+ + + The maximum number of file objects supported by the active and standby NameNodes is 300,000,000 (equivalent to 150,000,000 small files). **dfs.namenode.max.objects** specifies the number of file objects that can be generated in the system. The default value is **0**, which indicates that the number of generated file objects is not limited. + +- DataNode capacity specifications + + In HDFS, blocks are stored on the DataNode as copies. The default number of copies is **3**, which can be set in the **dfs.replication** parameter. + + The number of blocks stored on all DataNode role instances in the cluster can be calculated based on the following formula: Number of HDFS blocks x 3 Average number of saved blocks = Number of HDFS blocks x 3/Number of DataNodes + + .. table:: **Table 2** DataNode specifications + + +-----------------------------------------------------------------------------------------------------------------+----------------+ + | Item | Specifications | + +=================================================================================================================+================+ + | Maximum number of blocks supported by a DataNode instance | 5,000,000 | + +-----------------------------------------------------------------------------------------------------------------+----------------+ + | Maximum number of blocks supported by a disk on a DataNode instance | 500,000 | + +-----------------------------------------------------------------------------------------------------------------+----------------+ + | Minimum number of disks required when the number of blocks supported by a DataNode instance reaches the maximum | 10 | + +-----------------------------------------------------------------------------------------------------------------+----------------+ + + .. table:: **Table 3** Number of DataNodes + + ===================== ================================ + Number of HDFS Blocks Minimum Number of DataNode Roles + ===================== ================================ + 10,000,000 10,000,000 \*3/5,000,000 = 6 + 50,000,000 50,000,000 \*3/5,000,000 = 30 + 100,000,000 100,000,000 \*3/5,000,000 = 60 + ===================== ================================ + +Setting Memory Parameters +------------------------- + +- Configuration rules of the NameNode JVM parameter + + Default value of the NameNode JVM parameter **GC_OPTS**: + + -Xms2G -Xmx4G -XX:NewSize=128M -XX:MaxNewSize=256M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=128M -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+PrintGCDetails -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFE -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE -XX:-OmitStackTraceInFastThrow -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Djdk.tls.ephemeralDHKeySize=3072 -Djdk.tls.rejectClientInitiatedRenegotiation=true -Djava.io.tmpdir=${Bigdata_tmp_dir} + + The number of NameNode files is proportional to the used memory size of the NameNode. When file objects change, you need to change **-Xms2G -Xmx4G -XX:NewSize=128M --XX:MaxNewSize=256M** in the default value. The following table lists the reference values. + + .. table:: **Table 4** NameNode JVM configuration + + +------------------------+------------------------------------------------------+ + | Number of File Objects | Reference Value | + +========================+======================================================+ + | 10,000,000 | -Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M | + +------------------------+------------------------------------------------------+ + | 20,000,000 | -Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G | + +------------------------+------------------------------------------------------+ + | 50,000,000 | -Xms32G -Xmx32G -XX:NewSize=3G -XX:MaxNewSize=3G | + +------------------------+------------------------------------------------------+ + | 100,000,000 | -Xms64G -Xmx64G -XX:NewSize=6G -XX:MaxNewSize=6G | + +------------------------+------------------------------------------------------+ + | 200,000,000 | -Xms96G -Xmx96G -XX:NewSize=9G -XX:MaxNewSize=9G | + +------------------------+------------------------------------------------------+ + | 300,000,000 | -Xms164G -Xmx164G -XX:NewSize=12G -XX:MaxNewSize=12G | + +------------------------+------------------------------------------------------+ + +- Configuration rules of the DataNode JVM parameter + + Default value of the DataNode JVM parameter **GC_OPTS**: + + -Xms2G -Xmx4G -XX:NewSize=128M -XX:MaxNewSize=256M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=128M -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+PrintGCDetails -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFE -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE -XX:-OmitStackTraceInFastThrow -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Djdk.tls.ephemeralDHKeySize=3072 -Djdk.tls.rejectClientInitiatedRenegotiation=true -Djava.io.tmpdir=${Bigdata_tmp_dir} + + The average number of blocks stored in each DataNode instance in the cluster is: Number of HDFS blocks x 3/Number of DataNodes. If the average number of blocks changes, you need to change **-Xms2G -Xmx4G -XX:NewSize=128M -XX:MaxNewSize=256M** in the default value. The following table lists the reference values. + + .. table:: **Table 5** DataNode JVM configuration + + +-------------------------------------------------+----------------------------------------------------+ + | Average Number of Blocks in a DataNode Instance | Reference Value | + +=================================================+====================================================+ + | 2,000,000 | -Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M | + +-------------------------------------------------+----------------------------------------------------+ + | 5,000,000 | -Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G | + +-------------------------------------------------+----------------------------------------------------+ + + **Xmx** specifies memory which corresponds to the threshold of the number of DataNode blocks, and each GB memory supports a maximum of 500,000 DataNode blocks. Set the memory as required. + +Viewing the HDFS Capacity Status +-------------------------------- + +- NameNode information + + For MRS 1.9.2 or earlier: Log in to MRS Manager and choose **Services** > **HDFS** > **NameNode (Active)**. Click **Overview** and check the number of file objects, files, directories, or blocks in the HDFS in **Summary**. + + For versions earlier than MRS 3.\ *x*: Log in to the MRS console, and choose **Components** > **HDFS** > **NameNode (Active)**. Click **Overview** and check the number of file objects, files, directories, or blocks in the HDFS in **Summary**. + + For MRS 3.\ *x* or later: Log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **NameNode(Active)**, and click **Overview** to view information like the number of file objects, files, directories, and blocks in HDFS in **Summary** area. + +- DataNode information + + For MRS 1.9.2 or earlier: Log in to MRS Manager and choose **Services** > **HDFS** > **NameNode (Active)**. Click **DataNodes** and check the number of blocks of all DataNodes that report alarms. + + For versions earlier than MRS 3.\ *x*: Log in to the MRS console and choose **Components** > **HDFS** > **NameNode (Active)**. Click **DataNodes** and check the number of blocks of all DataNodes that report alarms. + + For MRS 3.\ *x* or later: Log in to FusionInsight Manager, choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **NameNode(Active)**, and click **DataNodes** to view the number of blocks on all DataNodes that report alarms. + +- Alarm information + + Check whether the alarms whose IDs are 14007, 14008, and 14009 are generated and change the alarm thresholds as required. diff --git a/doc/component-operation-guide/source/using_hdfs/reducing_the_probability_of_abnormal_client_application_operation_when_the_network_is_not_stable.rst b/doc/component-operation-guide/source/using_hdfs/reducing_the_probability_of_abnormal_client_application_operation_when_the_network_is_not_stable.rst new file mode 100644 index 0000000..cbd9a41 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/reducing_the_probability_of_abnormal_client_application_operation_when_the_network_is_not_stable.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_0811.html + +.. _mrs_01_0811: + +Reducing the Probability of Abnormal Client Application Operation When the Network Is Not Stable +================================================================================================ + +Scenario +-------- + +Clients probably encounter running errors when the network is not stable. Users can adjust the following parameter values to improve the running efficiency. + +Configuration Description +------------------------- + +Go to the **All Configurations** page of HDFS and enter a parameter name in the search box by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + +.. table:: **Table 1** Parameter description + + +--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Default Value | + +============================================+=======================================================================================================================================================================================================+=======================+ + | ha.health-monitor.rpc-timeout.ms | Timeout interval during the NameNode health check performed by ZKFC. Increasing this value can prevent dual active NameNodes and reduce the probability of application running exceptions on clients. | 180,000 | + | | | | + | | Unit: millisecond. Value range: 30,000 to 3,600,000 | | + +--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | ipc.client.connect.max.retries.on.timeouts | Number of retry times when the socket connection between a server and a client times out. | 45 | + | | | | + | | Value range: 1 to 256 | | + +--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | ipc.client.connect.timeout | Timeout interval of the socket connection between a client and a server. Increasing the value of this parameter increases the timeout interval for setting up a connection. | 20,000 | + | | | | + | | Unit: millisecond. Value range: 1 to 3,600,000 | | + +--------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/running_the_distcp_command.rst b/doc/component-operation-guide/source/using_hdfs/running_the_distcp_command.rst new file mode 100644 index 0000000..1f3069d --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/running_the_distcp_command.rst @@ -0,0 +1,253 @@ +:original_name: mrs_01_0794.html + +.. _mrs_01_0794: + +Running the DistCp Command +========================== + +Scenario +-------- + +DistCp is a tool used to perform large-amount data replication between clusters or in a cluster. It uses MapReduce tasks to implement distributed copy of a large amount of data. + +Prerequisites +------------- + +- The Yarn client or a client that contains Yarn has been installed. For example, the installation directory is **/opt/client**. +- Service users of each component are created by the system administrator based on service requirements. In security mode, machine-machine users need to download the keytab file. A human-machine user must change the password upon the first login. (Not involved in normal mode) +- To copy data between clusters, you need to enable the inter-cluster data copy function on both clusters. + +Procedure +--------- + +#. Log in to the node where the client is installed. + +#. Run the following command to go to the client installation directory: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If the cluster is in security mode, the user group to which the user executing the DistCp command belongs must be **supergroup** and the user run the following command to perform user authentication. In normal mode, user authentication is not required. + + **kinit** *Component service user* + +#. Run the DistCp command. The following provides an example: + + **hadoop distcp hdfs://hacluster/source hdfs://hacluster/target** + +Common Usage of DistCp +---------------------- + +#. The following is an example of the commonest usage of DistCp: + + .. code-block:: + + hadoop distcp -numListstatusThreads 40 -update -delete -prbugpaxtq hdfs://cluster1/source hdfs://cluster2/target + + .. note:: + + In the preceding command: + + - **-numListstatusThreads** specifies the number of threads for creating the list of 40 copied files. + + - **-update -delete** specifies that files at the source location and the target location are synchronized, and that files with excessive target locations are deleted. If you need to copy files incrementally, delete **-delete**. + + - If **-prbugpaxtq** and **-update** are used, it indicates that the status information of the copied file is also updated. + + - **hdfs://cluster1/source** indicates the source location, and **hdfs://cluster2/target** indicates the target location. + +#. The following is an example of data copy between clusters: + + .. code-block:: + + hadoop distcp hdfs://cluster1/foo/bar hdfs://cluster2/bar/foo + + .. note:: + + The network between cluster1 and cluster2 must be reachable, and the two clusters must use the same HDFS version or compatible HDFS versions. + +#. The following are multiple examples of data copy in a source directory: + + .. code-block:: + + hadoop distcp hdfs://cluster1/foo/a \ + hdfs://cluster1/foo/b \ + hdfs://cluster2/bar/foo + + The preceding command is used to copy the folders a and b of cluster1 to the **/bar/foo** directory of cluster2. The effect is equivalent to that of the following commands: + + .. code-block:: + + hadoop distcp -f hdfs://cluster1/srclist \ + hdfs://cluster2/bar/foo + + The content of **srclist** is as follows. Before running the DistCp command, upload the **srclist** file to HDFS. + + .. code-block:: + + hdfs://cluster1/foo/a + hdfs://cluster1/foo/b + +#. **-update** indicates that a to-be-copied file does not exist in the target location, or the content of the copied file in the target location is updated; and **-overwrite** is used to overwrite existing files in the target location. + + The following is an example of the difference between no option and any one of the two options (either **update** or **overwrite**) that is added: + + Assume that the structure of a file at the source location is as follows: + + .. code-block:: + + hdfs://cluster1/source/first/1 + hdfs://cluster1/source/first/2 + hdfs://cluster1/source/second/10 + hdfs://cluster1/source/second/20 + + Commands without options are as follows: + + .. code-block:: + + hadoop distcp hdfs://cluster1/source/first hdfs://cluster1/source/second hdfs://cluster2/target + + By default, the preceding command creates the **first** and **second** folders at the target location. Therefore, the copy results are as follows: + + .. code-block:: + + hdfs://cluster2/target/first/1 + hdfs://cluster2/target/first/2 + hdfs://cluster2/target/second/10 + hdfs://cluster2/target/second/20 + + The command with any one of the two options (for example, **update**) is as follows: + + .. code-block:: + + hadoop distcp -update hdfs://cluster1/source/first hdfs://cluster1/source/second hdfs://cluster2/target + + The preceding command copies only the content at the source location to the target location. Therefore, the copy results are as follows: + + .. code-block:: + + hdfs://cluster2/target/1 + hdfs://cluster2/target/2 + hdfs://cluster2/target/10 + hdfs://cluster2/target/20 + + .. note:: + + - If files with the same name exist in multiple source locations, the DistCp command fails. + + - If neither **update** nor **overwrite** is used and the file to be copied already exists in the target location, the file will be skipped. + - When **update** is used, if the file to be copied already exists in the target location but the file content is different, the file content in the target location is updated. + - When **overwrite** is used, if the file to be copied already exists in the target location, the file in the target location is still overwritten. + +#. The following table describes other command options: + + .. table:: **Table 1** Other command options + + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Option | Description | + +===================================+==============================================================================================================================================================================================================================================================================================================+ + | -p[rbugpcaxtq] | When **-update** is also used, the status information of a copied file is updated even if the content of the copied file is not updated. | + | | | + | | **r**: number of copies | + | | | + | | **b**: size of a block | + | | | + | | **u**: user to which the files belong | + | | | + | | **g**: user group to which the user belongs | + | | | + | | **p**: permission | + | | | + | | **c**: check and type | + | | | + | | **a**: access control | + | | | + | | **t**: timestamp | + | | | + | | **q**: quota information | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -i | Failures ignored during copying | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -log | Path of the specified log | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -v | Additional information in the specified log | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -m | Maximum number of concurrent copy tasks that can be executed at the same time | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -numListstatusThreads | Number of threads for constituting the list of copied files. This option increases the running speed of DistCp. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -overwrite | File at the target location that is to be overwritten | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -update | A file at the target location is updated if the size and check of a file at the source location are different from those of the file at the target location. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -append | When **-update** is also used, the content of the file at the source location is added to the file at the target location. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -f | Content of the **** file is used as the file list to be copied. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -filters | A local file is specified whose content contains multiple regular expressions. If the file to be copied matches a regular expression, the file is not copied. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -async | The **distcp** command is run asynchronously. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -atomic {-tmp } | An atomic copy can be performed. You can add a temporary directory during copying. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -bandwidth | The transmission bandwidth of each copy task. Unit: MB/s. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -delete | The files that exist in the target location is deleted but do not exist in the source location. This option is usually used with **-update**, and indicates that files at the source location are synchronized with those at the target location and the redundant files at the target location are deleted. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -diff | The differences between the old and new versions are copied to a file in the old version at the target location. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -skipcrccheck | Whether to skip the cyclic redundancy check (CRC) between the source file and the target file. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -strategy {dynamic|uniformsize} | The policy for copying a task. The default policy is **uniformsize**, that is, each copy task copies the same number of bytes. | + +-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +FAQs of DistCp +-------------- + +#. When you run the DistCp command, if the content of some copied files is large, you are advised to change the timeout period of MapReduce that executes the copy task. It can be implemented by specifying the **mapreduce.task.timeout** in the DistCp command. For example, run the following command to change the timeout to 30 minutes: + + .. code-block:: + + hadoop distcp -Dmapreduce.task.timeout=1800000 hdfs://cluster1/source hdfs://cluster2/target + + Or, you can also use **filters** to exclude the large files out of the copy process. The command example is as follows: + + .. code-block:: + + hadoop distcp -filters /opt/client/filterfile hdfs://cluster1/source hdfs://cluster2/target + + In the preceding command, *filterfile* indicates a local file, which contains multiple expressions used to match the path of a file that is not copied. The following is an example: + + .. code-block:: + + .*excludeFile1.* + .*excludeFile2.* + +#. If the DistCp command unexpectedly quits, the error message "java.lang.OutOfMemoryError" is displayed. + + This is because the memory required for running the copy command exceeds the preset memory limit (default value: 128 MB). You can change the memory upper limit of the client by modifying **CLIENT_GC_OPTS** in **\ **/HDFS/component_env**. For example, if you want to set the memory upper limit to 1 GB, refer to the following configuration: + + .. code-block:: + + CLIENT_GC_OPTS="-Xmx1G" + + After the modification, run the following command to make the modification take effect: + + **source** {*Client installation path*}\ **/bigdata_env** + +#. When the dynamic policy is used to run the DistCp command, the command exits unexpectedly and the error message "Too many chunks created with splitRatio" is displayed. + + The cause of this problem is that the value of **distcp.dynamic.max.chunks.tolerable** (default value: 20,000) is less than the value of **distcp.dynamic.split.ratio** (default value: 2) multiplied by the number of Maps. This problem occurs when the number of Maps exceeds 10,000. You can use the **-m** parameter to reduce the number of Maps to less than 10,000. + + .. code-block:: + + hadoop distcp -strategy dynamic -m 9500 hdfs://cluster1/source hdfs://cluster2/target + + Alternatively, you can use the **-D** parameter to set **distcp.dynamic.max.chunks.tolerable** to a large value. + + .. code-block:: + + hadoop distcp -Ddistcp.dynamic.max.chunks.tolerable=30000 -strategy dynamic hdfs://cluster1/source hdfs://cluster2/target diff --git a/doc/component-operation-guide/source/using_hdfs/setting_permissions_on_files_and_directories.rst b/doc/component-operation-guide/source/using_hdfs/setting_permissions_on_files_and_directories.rst new file mode 100644 index 0000000..085c075 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/setting_permissions_on_files_and_directories.rst @@ -0,0 +1,32 @@ +:original_name: mrs_01_0807.html + +.. _mrs_01_0807: + +Setting Permissions on Files and Directories +============================================ + +Scenario +-------- + +HDFS allows users to modify the default permissions of files and directories. The default mask provided by the HDFS for creating file and directory permissions is **022**. If you have special requirements for the default permissions, you can set configuration items to change the default permissions. + +Configuration Description +------------------------- + +**Parameter portal:** + +Go to the **All Configurations** page of HDFS and enter a parameter name in the search box by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + +.. table:: **Table 1** Parameter description + + +---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ + | Parameter | Description | Default Value | + +===========================+==================================================================================================================================================================================================+=======================+ + | fs.permissions.umask-mode | This **umask** value (user mask) is used when the user creates files and directories in the HDFS on the clients. This parameter is similar to the file permission mask on Linux. | 022 | + | | | | + | | The parameter value can be in octal or in symbolic, for example, **022** (octal, the same as **u=rwx,g=r-x,o=r-x** in symbolic), or **u=rwx,g=rwx,o=** (symbolic, the same as **007** in octal). | | + | | | | + | | .. note:: | | + | | | | + | | The octal mask is opposite to the actual permission value. You are advised to use the symbol notation to make the description clearer. | | + +---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/setting_the_maximum_lifetime_and_renewal_interval_of_a_token.rst b/doc/component-operation-guide/source/using_hdfs/setting_the_maximum_lifetime_and_renewal_interval_of_a_token.rst new file mode 100644 index 0000000..a84dc1b --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/setting_the_maximum_lifetime_and_renewal_interval_of_a_token.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_0808.html + +.. _mrs_01_0808: + +Setting the Maximum Lifetime and Renewal Interval of a Token +============================================================ + +Scenario +-------- + +In security mode, users can flexibly set the maximum token lifetime and token renewal interval in HDFS based on cluster requirements. + +Configuration Description +------------------------- + +**Navigation path for setting parameters:** + +Go to the **All Configurations** page of HDFS and enter a parameter name in the search box by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + +.. table:: **Table 1** Parameter description + + +----------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ + | Parameter | Description | Default Value | + +==============================================+=========================================================================================================================================================+===============+ + | dfs.namenode.delegation.token.max-lifetime | This parameter is a server parameter. It specifies the maximum lifetime of a token. Unit: milliseconds. Value range: 10,000 to 10,000,000,000,000 | 604,800,000 | + +----------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ + | dfs.namenode.delegation.token.renew-interval | This parameter is a server parameter. It specifies the maximum lifetime to renew a token. Unit: milliseconds. Value range: 10,000 to 10,000,000,000,000 | 86,400,000 | + +----------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ diff --git a/doc/component-operation-guide/source/using_hdfs/using_hdfs_az_mover.rst b/doc/component-operation-guide/source/using_hdfs/using_hdfs_az_mover.rst new file mode 100644 index 0000000..b3e017d --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/using_hdfs_az_mover.rst @@ -0,0 +1,67 @@ +:original_name: mrs_01_2360.html + +.. _mrs_01_2360: + +Using HDFS AZ Mover +=================== + +Scenario +-------- + +AZ Mover is a copy migration tool used to move copies to meet the new AZ policies set on the directory. It can be used to migrate copies from one AZ policy to another. AZ Mover instructs NameNode to move copies based on a new AZ policy. If the NameNode refuses to delete the old copies, the new policy may not be met. For example, the copies are marked as outdated. + +Restrictions +------------ + +- Changing the policy name to **LOCAL_AZ** is the same as that to **ONE_AZ** because the client location cannot be determined when the uploaded file is written. +- Mover cannot determine the AZ status. As a result, the copy may be moved to the abnormal AZ and depends on NameNode for further processing. +- Mover depends on whether the number of DataNodes in each AZ meets the minimum requirement. If the AZ Mover is executed in an AZ with a small number of DataNodes, the result may be different from the expected result. +- Mover only meets the AZ-level policies and does not guarantee to meet the basic block placement policy (BPP). +- Mover does not support the change of replication factors. If the number of copies in the new AZ is different from that in the old AZ, an exception occurs. + +Procedure +--------- + +#. Run the following command to go to the client installation directory. + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If the cluster is in security mode, the user must have the read permission on the source directory or file and the write permission on the destination directory, and run the following command to authenticate the user: In normal mode, skip user authentication. + + **kinit** *Component service user* + +#. Create a directory and set an AZ policy. + + Run the following command to create a directory. + + **hdfs dfs -mkdir** <*path*> + + Run the following command to set the AZ policy (**azexpression** indicates the AZ policy): + + **hdfs dfsadmin -setAZExpression** *** * + + Run the following command to view the AZ policy: + + **hdfs dfsadmin -getAZExpression** * + +#. Upload files to the directory. + + **hdfs dfs -put <**\ *localfile*\ **> <**\ *hdfs-path*\ **>** + +#. Delete the old policy from the directory and set a new policy. + + Run the following command to clear the old policy: + + **hdfs dfsadmin -clearAZExpression <**\ *path*\ **>** + + Run the following command to configure a new policy: + + **hdfs dfsadmin -setAZExpression <**\ *path*\ **> <**\ *azexpression*\ **>** + +#. Run the **azmover** command to make the copy distribution meet the new AZ policy. + + **hdfs azmover -p /targetDirecotry** diff --git a/doc/component-operation-guide/source/using_hdfs/using_the_hdfs_client.rst b/doc/component-operation-guide/source/using_hdfs/using_the_hdfs_client.rst new file mode 100644 index 0000000..f2700c8 --- /dev/null +++ b/doc/component-operation-guide/source/using_hdfs/using_the_hdfs_client.rst @@ -0,0 +1,102 @@ +:original_name: mrs_01_1663.html + +.. _mrs_01_1663: + +Using the HDFS Client +===================== + +Scenario +-------- + +This section describes how to use the HDFS client in an O&M scenario or service scenario. + +Prerequisites +------------- + +- The client has been installed. + + For example, the installation directory is **/opt/hadoopclient**. The client directory in the following operations is only an example. Change it to the actual installation directory. + +- Service component users are created by the administrator as required. In security mode, machine-machine users need to download the keytab file. A human-machine user needs to change the password upon the first login. (This operation is not required in normal mode.) + + +Using the HDFS Client +--------------------- + +#. Log in to the node where the client is installed as the client installation user. + +#. Run the following command to go to the client installation directory: + + **cd /opt/hadoopclient** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. If the cluster is in security mode, run the following command to authenticate the user. In normal mode, user authentication is not required. + + **kinit** *Component service user* + +#. Run the HDFS Shell command. Example: + + **hdfs dfs -ls /** + +Common HDFS Client Commands +--------------------------- + +The following table lists common HDFS client commands. + +For more commands, see https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/CommandsManual.html#User_Commands. + +.. table:: **Table 1** Common HDFS client commands + + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + | Command | Description | Example | + +================================================================================+=============================================================+=========================================================================================+ + | **hdfs dfs -mkdir** *Folder name* | Used to create a folder. | **hdfs dfs -mkdir /tmp/mydir** | + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + | **hdfs dfs -ls** *Folder name* | Used to view a folder. | **hdfs dfs -ls /tmp** | + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + | **hdfs dfs -put** *Local file on the client node* *Specified HDFS path* | Used to upload a local file to a specified HDFS path. | **hdfs dfs -put /opt/test.txt /tmp** | + | | | | + | | | Upload the **/opt/test.txt** file on the client node to the **/tmp** directory of HDFS. | + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + | **hdfs dfs -get** *Specified file on HDFS* *Specified path on the client node* | Used to download the HDFS file to the specified local path. | **hdfs dfs -get /tmp/test.txt /opt/** | + | | | | + | | | Download the **/tmp/test.txt** file on HDFS to the **/opt** path on the client node. | + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + | **hdfs dfs -rm -r -f** *Specified folder on HDFS* | Used to delete a folder. | **hdfs dfs -rm -r -f /tmp/mydir** | + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + | **hdfs dfs -chmod** *Permission parameter File directory* | Used to configure the HDFS directory permission for a user. | **hdfs dfs -chmod 700 /tmp/test** | + +--------------------------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------+ + +Client-related FAQs +------------------- + +#. What do I do when the HDFS client exits abnormally and error message "java.lang.OutOfMemoryError" is displayed after the HDFS client command is running? + + This problem occurs because the memory required for running the HDFS client exceeds the preset upper limit (128 MB by default). You can change the memory upper limit of the client by modifying **CLIENT_GC_OPTS** in **\ **/HDFS/component_env**. For example, if you want to set the upper limit to 1 GB, run the following command: + + .. code-block:: + + CLIENT_GC_OPTS="-Xmx1G" + + After the modification, run the following command to make the modification take effect: + + **source** <*Client installation path*>/**/bigdata_env** + +#. How do I set the log level when the HDFS client is running? + + By default, the logs generated during the running of the HDFS client are printed to the console. The default log level is INFO. To enable the DEBUG log level for fault locating, run the following command to export an environment variable: + + **export HADOOP_ROOT_LOGGER=DEBUG,console** + + Then run the HDFS Shell command to generate the DEBUG logs. + + If you want to print INFO logs again, run the following command: + + **export HADOOP_ROOT_LOGGER=INFO,console** + +#. How do I delete HDFS files permanently? + + HDFS provides a recycle bin mechanism. Typically, after an HDFS file is deleted, the file is moved to the recycle bin of HDFS. If the file is no longer needed and the storage space needs to be released, clear the corresponding recycle bin directory, for example, **hdfs://hacluster/user/xxx/.Trash/Current/**\ *xxx*. diff --git a/doc/component-operation-guide/source/using_hive/access_control_of_a_dynamic_table_view_on_hive.rst b/doc/component-operation-guide/source/using_hive/access_control_of_a_dynamic_table_view_on_hive.rst new file mode 100644 index 0000000..e051b23 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/access_control_of_a_dynamic_table_view_on_hive.rst @@ -0,0 +1,38 @@ +:original_name: mrs_01_0959.html + +.. _mrs_01_0959: + +Access Control of a Dynamic Table View on Hive +============================================== + +Scenario +-------- + +This section describes how to create a view on Hive when MRS is configured in security mode, authorize access permissions to different users, and specify that different users access different data. + +In the view, Hive can obtain the built-in function **current_user()** of the users who submit tasks on the client and filter the users. This way, authorized users can only access specific data in the view. + +.. note:: + + In normal mode, the **current_user()** function cannot distinguish users who submit tasks on the client. Therefore, the access control function takes effect only for Hive in security mode. + + If the **current_user()** function is used in the actual service logic, the possible risks must be fully evaluated during the conversion between the security mode and normal mode. + +Operation Example +----------------- + +- If the current_user function is not used, different views need to be created for different users to access different data. + + - Authorize the view **v1** permission to user **hiveuser1**. The user **hiveuser1** can access data with **type** set to **hiveuser1** in **table1**. + + **create view v1 as select \* from table1 where type='hiveuser1'** + + - Authorize the view **v2** permission to user **hiveuser2**. The user **hiveuser2** can access data with **type** set to **hiveuser2** in **table1**. + + **create view v2 as select \* from table1 where type='hiveuser2'** + +- If the current_user function is used, only one view needs to be created. + + Authorize the view **v** permission to users **hiveuser1** and **hiveuser2**. When user **hiveuser1** queries view **v**, the current_user() function is automatically converted to **hiveuser1**. When user **hiveuser2** queries view **v**, the **current_user()** function is automatically converted to **hiveuser2**. + + **create view v as select \* from table1 where type=current_user()** diff --git a/doc/component-operation-guide/source/using_hive/authorizing_over_32_roles_in_hive.rst b/doc/component-operation-guide/source/using_hive/authorizing_over_32_roles_in_hive.rst new file mode 100644 index 0000000..a972d9d --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/authorizing_over_32_roles_in_hive.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_0972.html + +.. _mrs_01_0972: + +Authorizing Over 32 Roles in Hive +================================= + +Scenario +-------- + +This function applies to Hive. + +The number of OS user groups is limited, and the number of roles that can be created in Hive cannot exceed 32. After this function is enabled, more than 32 roles can be created in Hive. + +.. note:: + + - After this function is enabled and the table or database is authorized, roles that have the same permission on the table or database will be combined using vertical bars (|). When the ACL permission is queried, the combined result is displayed, which is different from that before the function is enabled. This operation is irreversible. Determine whether to make adjustment based on the actual application scenario. + - MRS 3.\ *x* and later versions support Ranger. If the current component uses Ranger for permission control, you need to configure related policies based on Ranger for permission management. For details, see :ref:`Adding a Ranger Access Permission Policy for Hive `. + - After this function is enabled, a maximum of 512 roles (including **owner**) are supported by default. The number is controlled by the user-defined parameter **hive.supports.roles.max** of MetaStore. You can change the value based on the actual application scenario. + +Procedure +--------- + +#. The Hive service configuration page is displayed. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager, choose **Services** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + - For MRS 1.9.2 or later, click the cluster name on the MRS console, choose **Components** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + + - For MRS 3.\ *x* or later, log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. And choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Configurations** > **All Configurations**. + +#. Choose **MetaStore(Role)** > **Customization**, add a customized parameter to the **hivemetastore-site.xml** parameter file, set **Name** to **hive.supports.over.32.roles**, and set **Value** to **true**. Restart all Hive instances after the modification. +#. Choose **HiveServer(Role)** > **Customization**, add a customized parameter to the **hive-site.xml** parameter file, set **Name** to **hive.supports.over.32.roles**, and set **Value** to **true**. Restart all Hive instances after the modification. diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/description_of_hive_table_location_either_be_an_obs_or_hdfs_path.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/description_of_hive_table_location_either_be_an_obs_or_hdfs_path.rst new file mode 100644 index 0000000..430c71e --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/description_of_hive_table_location_either_be_an_obs_or_hdfs_path.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_1763.html + +.. _mrs_01_1763: + +Description of Hive Table Location (Either Be an OBS or HDFS Path) +================================================================== + +Question +-------- + +Can Hive tables be stored in OBS or HDFS? + +Answer +------ + +#. The location of a common Hive table stored on OBS can be set to an HDFS path. +#. In the same Hive service, you can create tables stored in OBS and HDFS, respectively. +#. For a Hive partitioned table stored on OBS, the location of the partition cannot be set to an HDFS path. (For a partitioned table stored on HDFS, the location of the partition cannot be changed to OBS.) diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/error_reported_when_the_where_condition_is_used_to_query_tables_with_excessive_partitions_in_fusioninsight_hive.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/error_reported_when_the_where_condition_is_used_to_query_tables_with_excessive_partitions_in_fusioninsight_hive.rst new file mode 100644 index 0000000..06aa74a --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/error_reported_when_the_where_condition_is_used_to_query_tables_with_excessive_partitions_in_fusioninsight_hive.rst @@ -0,0 +1,25 @@ +:original_name: mrs_01_1761.html + +.. _mrs_01_1761: + +Error Reported When the WHERE Condition Is Used to Query Tables with Excessive Partitions in FusionInsight Hive +=============================================================================================================== + +Question +-------- + +When a table with more than 32,000 partitions is created in Hive, an exception occurs during the query with the WHERE partition. In addition, the exception information printed in **metastore.log** contains the following information: + +.. code-block:: + + Caused by: java.io.IOException: Tried to send an out-of-range integer as a 2-byte value: 32970 + at org.postgresql.core.PGStream.SendInteger2(PGStream.java:199) + at org.postgresql.core.v3.QueryExecutorImpl.sendParse(QueryExecutorImpl.java:1330) + at org.postgresql.core.v3.QueryExecutorImpl.sendOneQuery(QueryExecutorImpl.java:1601) + at org.postgresql.core.v3.QueryExecutorImpl.sendParse(QueryExecutorImpl.java:1191) + at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:346) + +Answer +------ + +During a query with partition conditions, HiveServer optimizes the partitions to avoid full table scanning. All partitions whose metadata meets the conditions need to be queried. However, the **sendOneQuery** interface provided by GaussDB limits the parameter value to **32767** in the **sendParse** method. If the number of partition conditions exceeds **32767**, an exception occurs. diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/hive_configuration_problems.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/hive_configuration_problems.rst new file mode 100644 index 0000000..7fb349c --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/hive_configuration_problems.rst @@ -0,0 +1,60 @@ +:original_name: mrs_01_24117.html + +.. _mrs_01_24117: + +Hive Configuration Problems +=========================== + +- The error message "java.lang.OutOfMemoryError: Java heap space." is displayed during Hive SQL execution. + + Solution: + + - For MapReduce tasks, increase the values of the following parameters: + + **set mapreduce.map.memory.mb=8192;** + + **set mapreduce.map.java.opts=-Xmx6554M;** + + **set mapreduce.reduce.memory.mb=8192;** + + **set mapreduce.reduce.java.opts=-Xmx6554M;** + + - For Tez tasks, increase the value of the following parameter: + + **set hive.tez.container.size=8192;** + +- After a column name is changed to a new one using the Hive SQL **as** statement, the error message "Invalid table alias or column reference 'xxx'." is displayed when the original column name is used for compilation. + + Solution: Run the **set hive.cbo.enable=true;** statement. + +- The error message "Unsupported SubQuery Expression 'xxx': Only SubQuery expressions that are top level conjuncts are allowed." is displayed during Hive SQL subquery compilation. + + Solution: Run the **set hive.cbo.enable=true;** statement. + +- The error message "CalciteSubquerySemanticException [Error 10249]: Unsupported SubQuery Expression Currently SubQuery expressions are only allowed as Where and Having Clause predicates." is displayed during Hive SQL subquery compilation. + + Solution: Run the **set hive.cbo.enable=true;** statement. + +- The error message "Error running query: java.lang.AssertionError: Cannot add expression of different type to set." is displayed during Hive SQL compilation. + + Solution: Run the **set hive.cbo.enable=false;** statement. + +- The error message "java.lang.NullPointerException at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFComputeStats$GenericUDAFNumericStatsEvaluator.init." is displayed during Hive SQL execution. + + Solution: Run the **set hive.map.aggr=false;** statement. + +- When **hive.auto.convert.join** is set to **true** (enabled by default) and **hive.optimize.skewjoin** is set to **true**, the error message "ClassCastException org.apache.hadoop.hive.ql.plan.ConditionalWork cannot be cast to org.apache.hadoop.hive.ql.plan.MapredWork" is displayed. + + Solution: Run the **set hive.optimize.skewjoin=false;** statement. + +- When **hive.auto.convert.join** is set to **true** (enabled by default), **hive.optimize.skewjoin** is set to **true**, and **hive.exec.parallel** is set to **true**, the error message "java.io.FileNotFoundException: File does not exist:xxx/reduce.xml" is displayed. + + Solution: + + - Method 1: Switch the execution engine to Tez. For details, see :ref:`Switching the Hive Execution Engine to Tez `. + - Method 2: Run the **set hive.exec.parallel=false;** statement. + - Method 3: Run the **set hive.auto.convert.join=false;** statement. + +- Eerror message "NullPointerException at org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.mergeJoinComputeKeys" is displayed when Hive on Tez executes bucket map join. + + Solution: Run the **set tez.am.container.reuse.enabled=false;** statement. diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_delete_udfs_on_multiple_hiveservers_at_the_same_time.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_delete_udfs_on_multiple_hiveservers_at_the_same_time.rst new file mode 100644 index 0000000..da038cd --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_delete_udfs_on_multiple_hiveservers_at_the_same_time.rst @@ -0,0 +1,72 @@ +:original_name: mrs_01_1753.html + +.. _mrs_01_1753: + +How Do I Delete UDFs on Multiple HiveServers at the Same Time? +============================================================== + +Question +-------- + +How can I delete permanent user-defined functions (UDFs) on multiple HiveServers at the same time? + +Answer +------ + +Multiple HiveServers share one MetaStore database. Therefore, there is a delay in the data synchronization between the MetaStore database and the HiveServer memory. If a permanent UDF is deleted from one HiveServer, the operation result cannot be synchronized to the other HiveServers promptly. + +In this case, you need to log in to the Hive client to connect to each HiveServer and delete permanent UDFs on the HiveServers one by one. The operations are as follows: + +#. Log in to the node where the Hive client is installed as the Hive client installation user. + +#. Run the following command to go to the client installation directory: + + **cd** *Client installation directory* + + For example, if the client installation directory is **/opt/client**, run the following command: + + **cd /opt/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. Run the following command to authenticate the user: + + **kinit** *Hive service user* + + .. note:: + + The login user must have the Hive admin rights. + +#. .. _mrs_01_1753__l7ef35cc9f4de4ef9966a1cda923d47e5: + + Run the following command to connect to the specified HiveServer: + + **beeline -u "jdbc:hive2://**\ *10.39.151.74*\ **:**\ *21066*\ **/default;sasl.qop=auth-conf;auth=KERBEROS;principal=**\ *hive*/*hadoop.@*" + + .. note:: + + - *10.39.151.74* is the IP address of the node where the HiveServer is located. + - *21066* is the port number of the HiveServer. The HiveServer port number ranges from 21066 to 21070 by default. Use the actual port number. + - *hive* is the username. For example, if the Hive1 instance is used, the username is **hive1**. + - You can log in to FusionInsight Manager, choose **System** > **Permission** > **Domain and Mutual Trust**, and view the value of **Local Domain**, which is the current system domain name. + - **hive/hadoop.\ **** is the username. All letters in the system domain name contained in the username are lowercase letters. + +#. Run the following command to enable the Hive admin rights: + + **set role admin;** + +#. Run the following command to delete the permanent UDF: + + **drop function** *function_name*\ **;** + + .. note:: + + - *function_name* indicates the name of the permanent function. + - If the permanent UDF is created in Spark, the permanent UDF needs to be deleted from Spark and then from HiveServer by running the preceding command. + +#. Check whether the permanent UDFs are deleted from all HiveServers. + + - If yes, no further action is required. + - If no, go to :ref:`5 `. diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_disable_the_logging_function_of_hive.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_disable_the_logging_function_of_hive.rst new file mode 100644 index 0000000..cb476d7 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_disable_the_logging_function_of_hive.rst @@ -0,0 +1,54 @@ +:original_name: mrs_01_24482.html + +.. _mrs_01_24482: + +How Do I Disable the Logging Function of Hive? +============================================== + +Question +-------- + +How do I disable the logging function of Hive? + +Answer +------ + +#. Log in to the node where the client is installed as user **root**. + +#. Run the following command to switch to the client installation directory, for example, **/opt/Bigdata/client**: + + **cd** **/opt/Bigdata/client** + +#. Run the following command to configure environment variables: + + **source bigdata_env** + +#. Log in to the Hive client based on the cluster authentication mode. + + - In security mode, run the following command to complete user authentication and log in to the Hive client: + + **kinit** *Component service user* + + **beeline** + + - In normal mode, run the following command to log in to the Hive client: + + - Run the following command to log in to the Hive client as the component service user: + + **beeline -n** *component service user* + + - If no component service user is specified, the current OS user is used to log in to the Hive client. + + **beeline** + +#. Run the following command to disable the logging function: + + **set hive.server2.logging.operation.enabled=false;** + +#. Run the following command to check whether the logging function is disabled. If the following information is displayed, the logging function is disabled successfully. + + **set hive.server2.logging.operation.enabled;** + + |image1| + +.. |image1| image:: /_static/images/en-us_image_0000001296250116.png diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_forcibly_stop_mapreduce_jobs_executed_by_hive.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_forcibly_stop_mapreduce_jobs_executed_by_hive.rst new file mode 100644 index 0000000..75cc637 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_forcibly_stop_mapreduce_jobs_executed_by_hive.rst @@ -0,0 +1,19 @@ +:original_name: mrs_01_1756.html + +.. _mrs_01_1756: + +How Do I Forcibly Stop MapReduce Jobs Executed by Hive? +======================================================= + +Question +-------- + +How do I stop a MapReduce task manually if the task is suspended for a long time? + +Answer +------ + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Yarn**. +#. On the left pane, click **ResourceManager(Host name, Active)**, and log in to Yarn. +#. Click the button corresponding to the task ID. On the task page that is displayed, click **Kill Application** in the upper left corner and click **OK** in the displayed dialog box to stop the task. diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_monitor_the_hive_table_size.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_monitor_the_hive_table_size.rst new file mode 100644 index 0000000..e406424 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_monitor_the_hive_table_size.rst @@ -0,0 +1,38 @@ +:original_name: mrs_01_1758.html + +.. _mrs_01_1758: + +How Do I Monitor the Hive Table Size? +===================================== + +Question +-------- + +How do I monitor the Hive table size? + +Answer +------ + +The HDFS refined monitoring function allows you to monitor the size of a specified table directory. + +Prerequisites +------------- + +- The Hive and HDFS components are running properly. +- The HDFS refined monitoring function is normal. + +Procedure +--------- + +#. Log in to FusionInsight Manager. + +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **HDFS** > **Resource**. + +#. Click the first icon in the upper left corner of **Resource Usage (by Directory)**, as shown in the following figure. + + |image1| + +4. In the displayed sub page for configuring space monitoring, click **Add**. +5. In the displayed **Add a Monitoring Directory** dialog box, set **Name** to the name or the user-defined alias of the table to be monitored and **Path** to the path of the monitored table. Click **OK**. In the monitoring result, the horizontal coordinate indicates the time, and the vertical coordinate indicates the size of the monitored directory. + +.. |image1| image:: /_static/images/en-us_image_0000001296090496.jpg diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_prevent_key_directories_from_data_loss_caused_by_misoperations_of_the_insert_overwrite_statement.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_prevent_key_directories_from_data_loss_caused_by_misoperations_of_the_insert_overwrite_statement.rst new file mode 100644 index 0000000..d451c4f --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_do_i_prevent_key_directories_from_data_loss_caused_by_misoperations_of_the_insert_overwrite_statement.rst @@ -0,0 +1,32 @@ +:original_name: mrs_01_1759.html + +.. _mrs_01_1759: + +How Do I Prevent Key Directories from Data Loss Caused by Misoperations of the **insert overwrite** Statement? +============================================================================================================== + +Question +-------- + +How do I prevent key directories from data loss caused by misoperations of the **insert overwrite** statement? + +Answer +------ + +During monitoring of key Hive databases, tables, or directories, to prevent data loss caused by misoperations of the **insert overwrite** statement, configure **hive.local.dir.confblacklist** in Hive to protect directories. + +This configuration item has been configured for directories such as **/opt/** and **/user/hive/warehouse** by default. + +Prerequisites +------------- + +The Hive and HDFS components are running properly. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Configurations** > **All Configurations**, and search for the **hive.local.dir.confblacklist** configuration item. + +3. Add paths of databases, tables, or directories to be protected in the parameter value. +4. Click **Save** to save the settings. diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_to_perform_operations_on_local_files_with_hive_user-defined_functions.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_to_perform_operations_on_local_files_with_hive_user-defined_functions.rst new file mode 100644 index 0000000..015f9fb --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/how_to_perform_operations_on_local_files_with_hive_user-defined_functions.rst @@ -0,0 +1,31 @@ +:original_name: mrs_01_1755.html + +.. _mrs_01_1755: + +How to Perform Operations on Local Files with Hive User-Defined Functions +========================================================================= + +Question +-------- + +How to perform operations on local files (such as reading the content of a file) with Hive user-defined functions? + +Answer +------ + +By default, you can perform operations on local files with their relative paths in UDF. The following are sample codes: + +.. code-block:: + + public String evaluate(String text) { + // some logic + File file = new File("foo.txt"); + // some logic + // do return here + } + +In Hive, upload the file **foo.txt** used in UDF to HDFS, such as **hdfs://hacluster/tmp/foo.txt**. You can perform operations on the **foo.txt** file by creating UDF with the following sentences: + +**create function testFunc as 'some.class' using jar 'hdfs://hacluster/somejar.jar', file 'hdfs://hacluster/tmp/foo.txt';** + +In abnormal cases, if the value of **hive.fetch.task.conversion** is **more**, you can perform operations on local files in UDF by using absolute path instead of relative path. In addition, you must ensure that the file exists on all HiveServer nodes and NodeManager nodes and **omm** user have corresponding operation rights. diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/index.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/index.rst new file mode 100644 index 0000000..9448c7d --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/index.rst @@ -0,0 +1,46 @@ +:original_name: mrs_01_1752.html + +.. _mrs_01_1752: + +Common Issues About Hive +======================== + +- :ref:`How Do I Delete UDFs on Multiple HiveServers at the Same Time? ` +- :ref:`Why Cannot the DROP operation Be Performed on a Backed-up Hive Table? ` +- :ref:`How to Perform Operations on Local Files with Hive User-Defined Functions ` +- :ref:`How Do I Forcibly Stop MapReduce Jobs Executed by Hive? ` +- :ref:`How Do I Monitor the Hive Table Size? ` +- :ref:`How Do I Prevent Key Directories from Data Loss Caused by Misoperations of the insert overwrite Statement? ` +- :ref:`Why Is Hive on Spark Task Freezing When HBase Is Not Installed? ` +- :ref:`Error Reported When the WHERE Condition Is Used to Query Tables with Excessive Partitions in FusionInsight Hive ` +- :ref:`Why Cannot I Connect to HiveServer When I Use IBM JDK to Access the Beeline Client? ` +- :ref:`Description of Hive Table Location (Either Be an OBS or HDFS Path) ` +- :ref:`Why Cannot Data Be Queried After the MapReduce Engine Is Switched After the Tez Engine Is Used to Execute Union-related Statements? ` +- :ref:`Why Does Hive Not Support Concurrent Data Writing to the Same Table or Partition? ` +- :ref:`Why Does Hive Not Support Vectorized Query? ` +- :ref:`Why Does Metadata Still Exist When the HDFS Data Directory of the Hive Table Is Deleted by Mistake? ` +- :ref:`How Do I Disable the Logging Function of Hive? ` +- :ref:`Why Hive Tables in the OBS Directory Fail to Be Deleted? ` +- :ref:`Hive Configuration Problems ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + how_do_i_delete_udfs_on_multiple_hiveservers_at_the_same_time + why_cannot_the_drop_operation_be_performed_on_a_backed-up_hive_table + how_to_perform_operations_on_local_files_with_hive_user-defined_functions + how_do_i_forcibly_stop_mapreduce_jobs_executed_by_hive + how_do_i_monitor_the_hive_table_size + how_do_i_prevent_key_directories_from_data_loss_caused_by_misoperations_of_the_insert_overwrite_statement + why_is_hive_on_spark_task_freezing_when_hbase_is_not_installed + error_reported_when_the_where_condition_is_used_to_query_tables_with_excessive_partitions_in_fusioninsight_hive + why_cannot_i_connect_to_hiveserver_when_i_use_ibm_jdk_to_access_the_beeline_client + description_of_hive_table_location_either_be_an_obs_or_hdfs_path + why_cannot_data_be_queried_after_the_mapreduce_engine_is_switched_after_the_tez_engine_is_used_to_execute_union-related_statements + why_does_hive_not_support_concurrent_data_writing_to_the_same_table_or_partition + why_does_hive_not_support_vectorized_query + why_does_metadata_still_exist_when_the_hdfs_data_directory_of_the_hive_table_is_deleted_by_mistake + how_do_i_disable_the_logging_function_of_hive + why_hive_tables_in_the_obs_directory_fail_to_be_deleted + hive_configuration_problems diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_cannot_data_be_queried_after_the_mapreduce_engine_is_switched_after_the_tez_engine_is_used_to_execute_union-related_statements.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_cannot_data_be_queried_after_the_mapreduce_engine_is_switched_after_the_tez_engine_is_used_to_execute_union-related_statements.rst new file mode 100644 index 0000000..364ae22 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_cannot_data_be_queried_after_the_mapreduce_engine_is_switched_after_the_tez_engine_is_used_to_execute_union-related_statements.rst @@ -0,0 +1,18 @@ +:original_name: mrs_01_2309.html + +.. _mrs_01_2309: + +Why Cannot Data Be Queried After the MapReduce Engine Is Switched After the Tez Engine Is Used to Execute Union-related Statements? +=================================================================================================================================== + +Question +-------- + +Hive uses the Tez engine to execute union-related statements to write data. After Hive is switched to the MapReduce engine for query, no data is found. + +Answer +------ + +When Hive uses the Tez engine to execute the union-related statement, the generated output file is stored in the **HIVE_UNION_SUBDIR** directory. After Hive is switched back to the MapReduce engine, files in the directory are not read by default. Therefore, data in the **HIVE_UNION_SUBDIR** directory is not read. + +In this case, you can set **mapreduce.input.fileinputformat.input.dir.recursive** to **true** to enable union optimization and determine whether to read data in the directory. diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_cannot_i_connect_to_hiveserver_when_i_use_ibm_jdk_to_access_the_beeline_client.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_cannot_i_connect_to_hiveserver_when_i_use_ibm_jdk_to_access_the_beeline_client.rst new file mode 100644 index 0000000..c428d83 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_cannot_i_connect_to_hiveserver_when_i_use_ibm_jdk_to_access_the_beeline_client.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_1762.html + +.. _mrs_01_1762: + +Why Cannot I Connect to HiveServer When I Use IBM JDK to Access the Beeline Client? +=================================================================================== + +Scenario +-------- + +When users check the JDK version used by the client, if the JDK version is IBM JDK, the Beeline client needs to be reconstructed. Otherwise, the client will fail to connect to HiveServer. + +Procedure +--------- + +#. Log in to FusionInsight Manager and choose **System** > **Permission** > **User**. In the **Operation** column of the target user, choose **More** > **Download Authentication Credential**, select the cluster information, and click **OK** to download the keytab file. + +#. Decompress the keytab file and use WinSCP to upload the decompressed **user.keytab** file to the Hive client installation directory on the node to be operated, for example, **/opt/client**. + +#. Run the following command to open the **Hive/component_env** configuration file in the Hive client directory: + + **vi** *Hive client installation directory*\ **/Hive/component_env** + + Add the following content to the end of the line where **export CLIENT_HIVE_URI** is located: + + .. code-block:: + + \; user.principal=Username @HADOOP.COM\;user.keytab=user.keytab file path/user.keytab diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_cannot_the_drop_operation_be_performed_on_a_backed-up_hive_table.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_cannot_the_drop_operation_be_performed_on_a_backed-up_hive_table.rst new file mode 100644 index 0000000..eb45b33 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_cannot_the_drop_operation_be_performed_on_a_backed-up_hive_table.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_1754.html + +.. _mrs_01_1754: + +Why Cannot the DROP operation Be Performed on a Backed-up Hive Table? +===================================================================== + +Question +-------- + +Why cannot the **DROP** operation be performed for a backed up Hive table? + +Answer +------ + +Snapshots have been created for an HDFS directory mapping to the backed up Hive table, so the HDFS directory cannot be deleted. As a result, the Hive table cannot be deleted. + +When a Hive table is being backed up, snapshots are created for the HDFS directory mapping to the table. The snapshot mechanism of HDFS has the following limitation: If snapshots have been created for an HDFS directory, the directory cannot be deleted or renamed unless the snapshots are deleted. When the **DROP** operation is performed for a Hive table (except the EXTERNAL table), the system attempts to delete the HDFS directory mapping to the table. If the directory fails to be deleted, the system displays a message indicating that the table fails to be deleted. + +If you need to delete this table, manually delete all backup tasks related to this table. diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_does_hive_not_support_concurrent_data_writing_to_the_same_table_or_partition.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_does_hive_not_support_concurrent_data_writing_to_the_same_table_or_partition.rst new file mode 100644 index 0000000..edf4a92 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_does_hive_not_support_concurrent_data_writing_to_the_same_table_or_partition.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_2310.html + +.. _mrs_01_2310: + +Why Does Hive Not Support Concurrent Data Writing to the Same Table or Partition? +================================================================================= + +Question +-------- + +Why Does Data Inconsistency Occur When Data Is Concurrently Written to a Hive Table Through an API? + +Answer +------ + +Hive does not support concurrent data insertion for the same table or partition. As a result, multiple tasks perform operations on the same temporary data directory, and one task moves the data of another task, causing task data exception. The service logic is modified so that data is inserted to the same table or partition in single thread mode. diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_does_hive_not_support_vectorized_query.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_does_hive_not_support_vectorized_query.rst new file mode 100644 index 0000000..a6afdd5 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_does_hive_not_support_vectorized_query.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_2325.html + +.. _mrs_01_2325: + +Why Does Hive Not Support Vectorized Query? +=========================================== + +Question +-------- + +When the vectorized parameter **hive.vectorized.execution.enabled** is set to **true**, why do some null pointers or type conversion exceptions occur occasionally when Hive on Tez/MapReduce/Spark is executed? + +Answer +------ + +Currently, Hive does not support vectorized execution. Many community issues are introduced during vectorized execution and are not resolved stably. The default value of **hive.vectorized.execution.enabled** is **false**. You are advised not to set this parameter to **true**. diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_does_metadata_still_exist_when_the_hdfs_data_directory_of_the_hive_table_is_deleted_by_mistake.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_does_metadata_still_exist_when_the_hdfs_data_directory_of_the_hive_table_is_deleted_by_mistake.rst new file mode 100644 index 0000000..828cfac --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_does_metadata_still_exist_when_the_hdfs_data_directory_of_the_hive_table_is_deleted_by_mistake.rst @@ -0,0 +1,26 @@ +:original_name: mrs_01_2343.html + +.. _mrs_01_2343: + +Why Does Metadata Still Exist When the HDFS Data Directory of the Hive Table Is Deleted by Mistake? +=================================================================================================== + +Question +-------- + +The HDFS data directory of the Hive table is deleted by mistake, but the metadata still exists. As a result, an error is reported during task execution. + +Answer +------ + +This is a exception caused by misoperation. You need to manually delete the metadata of the corresponding table and try again. + +Example: + +Run the following command to go to the console: + +**source ${BIGDATA_HOME}/FusionInsight_BASE\_8.1.0.1/install/FusionInsight-dbservice-2.7.0/.dbservice_profile** + +**gsql -p 20051 -U hive -d hivemeta -W HiveUser@** + +Run the **delete from tbls where tbl_id='xxx';** command. diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_hive_tables_in_the_obs_directory_fail_to_be_deleted.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_hive_tables_in_the_obs_directory_fail_to_be_deleted.rst new file mode 100644 index 0000000..c0326a9 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_hive_tables_in_the_obs_directory_fail_to_be_deleted.rst @@ -0,0 +1,16 @@ +:original_name: mrs_01_24486.html + +.. _mrs_01_24486: + +Why Hive Tables in the OBS Directory Fail to Be Deleted? +======================================================== + +Question +-------- + +In the scenario where the fine-grained permission is configured for multiple MRS users to access OBS, after the permission for deleting Hive tables in the OBS directory is added to the custom configuration of Hive, tables are deleted on the Hive client but still exist in the OBS directory. + +Answer +------ + +You do not have the permission to delete directories on OBS. As a result, Hive tables cannot be deleted. In this case, modify the custom IAM policy of the agency and configure Hive with the permission for deleting tables in the OBS directory. diff --git a/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_is_hive_on_spark_task_freezing_when_hbase_is_not_installed.rst b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_is_hive_on_spark_task_freezing_when_hbase_is_not_installed.rst new file mode 100644 index 0000000..7881fe4 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/common_issues_about_hive/why_is_hive_on_spark_task_freezing_when_hbase_is_not_installed.rst @@ -0,0 +1,28 @@ +:original_name: mrs_01_1760.html + +.. _mrs_01_1760: + +Why Is Hive on Spark Task Freezing When HBase Is Not Installed? +=============================================================== + +Scenario +-------- + +This function applies to Hive. + +Perform the following operations to configure parameters. When Hive on Spark tasks are executed in the environment where the HBase is not installed, freezing of tasks can be prevented. + +.. note:: + + The Spark kernel version of Hive on Spark tasks has been upgraded to Spark2x. Hive on Spark tasks can be executed is Spark2x is not installed. If HBase is not installed, when Spark tasks are executed, the system attempts to connect to the ZooKeeper to access HBase until timeout occurs by default. As a result, task freezing occurs. + + If HBase is not installed, perform the following operations to execute Hive on Spark tasks. If HBase is upgraded from an earlier version, you do not need to configure parameters after the upgrade. + +Procedure +--------- + +#. Log in to FusionInsight Manager. +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Configurations** > **All Configurations**. +#. Choose **HiveServer(Role)** > **Customization**. Add a customized parameter to the **spark-defaults.conf** parameter file. Set **Name** to **spark.security.credentials.hbase.enabled**, and set **Value** to **false**. +#. Click **Save**. In the dialog box that is displayed, click **OK**. +#. Choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Instance**, select all Hive instances, choose **More** > **Restart Instance**, enter the password, and click **OK**. diff --git a/doc/component-operation-guide/source/using_hive/configuring_hive_on_hbase_in_across_clusters_with_mutual_trust_enabled.rst b/doc/component-operation-guide/source/using_hive/configuring_hive_on_hbase_in_across_clusters_with_mutual_trust_enabled.rst new file mode 100644 index 0000000..1c92b35 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/configuring_hive_on_hbase_in_across_clusters_with_mutual_trust_enabled.rst @@ -0,0 +1,37 @@ +:original_name: mrs_01_24293.html + +.. _mrs_01_24293: + +Configuring Hive on HBase in Across Clusters with Mutual Trust Enabled +====================================================================== + +For mutually trusted Hive and HBase clusters with Kerberos authentication enabled, you can access the HBase cluster and synchronize its key configurations to HiveServer of the Hive cluster. + +Prerequisites +------------- + +The mutual trust relationship has been configured between the two security clusters with Kerberos authentication enabled. + +Procedure for Configuring Hive on HBase Across Clusters +------------------------------------------------------- + +#. Download the HBase configuration file and decompress it. + + a. Log in to FusionInsight Manager of the target HBase cluster and choose **Cluster** > **Services** > **HBase**. + b. Choose **More** > **Download Client**. + c. Download the HBase configuration file and choose **Configuration Files only** for **Select Client Type**. + +#. Log in to FusionInsight Manager of the source Hive cluster. + +#. Choose **Cluster** > **Services** > **Hive** and click the **Configurations** tab and then **All Configurations**. On the displayed page, add the following parameters to the **hive-site.xml** configuration file of the HiveServer role. + + Search for the following parameters in the **hbase-site.xml** configuration file of the downloaded HBase client and add them to HiveServer: + + - hbase.security.authentication + - hbase.security.authorization + - hbase.zookeeper.property.clientPort + - hbase.zookeeper.quorum (The domain name needs to be converted into an IP address.) + - hbase.regionserver.kerberos.principal + - hbase.master.kerberos.principal + +#. Save the configurations and restart Hive. diff --git a/doc/component-operation-guide/source/using_hive/configuring_hive_parameters.rst b/doc/component-operation-guide/source/using_hive/configuring_hive_parameters.rst new file mode 100644 index 0000000..2988542 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/configuring_hive_parameters.rst @@ -0,0 +1,51 @@ +:original_name: mrs_01_0582.html + +.. _mrs_01_0582: + +Configuring Hive Parameters +=========================== + +Navigation Path +--------------- + +Go to the Hive configurations page by referring to :ref:`Modifying Cluster Service Configuration Parameters `. + +Parameter Description +--------------------- + +.. table:: **Table 1** Hive parameter description + + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+ + | Parameter | Description | Default Value | + +========================================+======================================================================================================================================================================================================================================================================================================================================+=========================================+ + | hive.auto.convert.join | Whether Hive converts common **join** to **mapjoin** based on the input file size. | Possible values are as follows: | + | | | | + | | .. note:: | - true | + | | | - false | + | | When Hive is used to query a join table, whatever the table size is (if the data in the join table is less than 24 MB, it is a small one), set this parameter to **false**. If this parameter is set to **true**, new **mapjoin** cannot be generated when you query a join table. | | + | | | The default value is **true**. | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+ + | hive.default.fileformat | Indicates the default file format used by Hive. | Versions earlier than MRS 3.x: TextFile | + | | | | + | | | MRS 3.x or later: RCFile | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+ + | hive.exec.reducers.max | Indicates the maximum number of reducers in a MapReduce job submitted by Hive. | 999 | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+ + | hive.server2.thrift.max.worker.threads | Indicates the maximum number of threads that can be started in the HiveServer internal thread pool. | 1,000 | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+ + | hive.server2.thrift.min.worker.threads | Indicates the number of threads started during initialization in the HiveServer internal thread pool. | 5 | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+ + | hive.hbase.delete.mode.enabled | Indicates whether to enable the function of deleting HBase records from Hive. If this function is enabled, you can use **remove table xx where xxx** to delete HBase records from Hive. | true | + | | | | + | | .. note:: | | + | | | | + | | This parameter applies to MRS 3.\ *x* or later. | | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+ + | hive.metastore.server.min.threads | Indicates the number of threads started by MetaStore for processing connections. If the number of threads is more than the set value, MetaStore always maintains a number of threads that is not lower than the set value, that is, the number of resident threads in the MetaStore thread pool is always higher than the set value. | 200 | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+ + | hive.server2.enable.doAs | Indicates whether to simulate client users during sessions between HiveServer2 and other services (such as Yarn and HDFS). If you change the configuration item from **false** to **true**, users with only the column permission lose the permissions to access corresponding tables. | true | + | | | | + | | .. note:: | | + | | | | + | | This parameter applies to MRS 3.\ *x* or later. | | + +----------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+ diff --git a/doc/component-operation-guide/source/using_hive/configuring_https_http-based_rest_apis.rst b/doc/component-operation-guide/source/using_hive/configuring_https_http-based_rest_apis.rst new file mode 100644 index 0000000..53a6796 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/configuring_https_http-based_rest_apis.rst @@ -0,0 +1,36 @@ +:original_name: mrs_01_0957.html + +.. _mrs_01_0957: + +Configuring HTTPS/HTTP-based REST APIs +====================================== + +Scenario +-------- + +WebHCat provides external REST APIs for Hive. By default, the open-source community version uses the HTTP protocol. + +MRS Hive supports the HTTPS protocol that is more secure, and enables switchover between the HTTP protocol and the HTTPS protocol. + +.. note:: + + The security mode supports HTTPS and HTTP, and the common mode supports only HTTP. + +Procedure +--------- + +#. The Hive service configuration page is displayed. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager, choose **Services** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + - For MRS 1.9.2 or later, click the cluster name on the MRS console, choose **Components** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + + - For MRS 3.\ *x* or later, log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. And choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Configurations** > **All Configurations**. + +#. Modify the Hive configuration. + + - For versions earlier than MRS 3.x: Enter the parameter name in the search box, search for **templeton.protocol.type**, change the parameter value to **HTTPS** or **HTTP**, and restart the Hive service to use the corresponding protocol. + - For MRS 3.\ *x* or earlier: Choose **WebHCat** > **Security**. On the page that is displayed, select **HTTPS** or **HTTP**. After the modification, restart the Hive service to use the corresponding protocol. diff --git a/doc/component-operation-guide/source/using_hive/creating_databases_and_creating_tables_in_the_default_database_only_as_the_hive_administrator.rst b/doc/component-operation-guide/source/using_hive/creating_databases_and_creating_tables_in_the_default_database_only_as_the_hive_administrator.rst new file mode 100644 index 0000000..d0949b3 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/creating_databases_and_creating_tables_in_the_default_database_only_as_the_hive_administrator.rst @@ -0,0 +1,44 @@ +:original_name: mrs_01_0969.html + +.. _mrs_01_0969: + +Creating Databases and Creating Tables in the Default Database Only as the Hive Administrator +============================================================================================= + +Scenario +-------- + +This function is applicable to Hive and Spark2x for MRS 3.\ *x* or later, or Hive and Spark for versions earlier than MRS 3.x. + +After this function is enabled, only the Hive administrator can create databases and tables in the default database. Other users can use the databases only after being authorized by the Hive administrator. + +.. note:: + + - After this function is enabled, common users are not allowed to create a database or create a table in the default database. Based on the actual application scenario, determine whether to enable this function. + - Permissions of common users are restricted. In the scenario where common users have been used to perform operations, such as database creation, table script migration, and metadata recreation in an earlier version of database, the users can perform such operations on the database in the condition that this function is disabled temporarily after the database is migrated or after the cluster is upgraded. + +Procedure +--------- + +#. The Hive service configuration page is displayed. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager, choose **Services** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + - For MRS 1.9.2 or later, click the cluster name on the MRS console, choose **Components** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + + - For MRS 3.\ *x* or later, log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. And choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Configurations** > **All Configurations**. + +#. Choose **HiveServer(Role)** > **Customization**, add a customized parameter to the **hive-site.xml** parameter file, set **Name** to **hive.allow.only.admin.create**, and set **Value** to **true**. Restart all Hive instances after the modification. +#. Determine whether to enable this function on the Spark/Spark2x client. + + - If yes, go to :ref:`4 `. + - If no, no further action is required. + +4. .. _mrs_01_0969__li475373212497: + + Choose **SparkResource2x** > **Customization**, add a customized parameter to the **hive-site.xml** parameter file, set **Name** to **hive.allow.only.admin.create**, and set **Value** to **true**. Then, choose **JDBCServer2x** > **Customization** and repeat the preceding operations to add the customized parameter. Restart all Spark2x instances after the modification. + +5. Download and install the Spark/Spark2x client again. diff --git a/doc/component-operation-guide/source/using_hive/customizing_row_separators.rst b/doc/component-operation-guide/source/using_hive/customizing_row_separators.rst new file mode 100644 index 0000000..f1f250e --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/customizing_row_separators.rst @@ -0,0 +1,33 @@ +:original_name: mrs_01_0955.html + +.. _mrs_01_0955: + +Customizing Row Separators +========================== + +Scenario +-------- + +In most cases, a carriage return character is used as the row delimiter in Hive tables stored in text files, that is, the carriage return character is used as the terminator of a row during queries. However, some data files are delimited by special characters, and not a carriage return character. + +MRS Hive allows you to use different characters or character combinations to delimit rows of Hive text data. When creating a table, set **inputformat** to **SpecifiedDelimiterInputFormat**, and set the following parameter before search each time. Then the table data is queried by the specified delimiter. + +**set hive.textinput.record.delimiter='';** + +.. note:: + + - The Hue component of the current version does not support the configuration of multiple separators when files are imported to a Hive table. + - This section applies to MRS 3.\ *x* or later. + +Procedure +--------- + +#. Specify **inputFormat** and **outputFormat** when creating a table. + + **CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]** *[db_name.]table_name* **[(**\ *col_name data_type* **[COMMENT** *col_comment*\ **],** *...*\ **)] [ROW FORMAT** *row_format*\ **] STORED AS inputformat 'org.apache.hadoop.hive.contrib.fileformat.SpecifiedDelimiterInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'** + +#. Specify the delimiter before search. + + **set hive.textinput.record.delimiter='!@!'** + + Hive will use '!@!' as the row delimiter. diff --git a/doc/component-operation-guide/source/using_hive/deleting_single-row_records_from_hive_on_hbase.rst b/doc/component-operation-guide/source/using_hive/deleting_single-row_records_from_hive_on_hbase.rst new file mode 100644 index 0000000..46170d0 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/deleting_single-row_records_from_hive_on_hbase.rst @@ -0,0 +1,29 @@ +:original_name: mrs_01_0956.html + +.. _mrs_01_0956: + +Deleting Single-Row Records from Hive on HBase +============================================== + +Scenario +-------- + +Due to the limitations of underlying storage systems, Hive does not support the ability to delete a single piece of table data. In Hive on HBase, MRS Hive supports the ability to delete a single piece of HBase table data. Using a specific syntax, Hive can delete one or more pieces of data from an HBase table. + +.. table:: **Table 1** Permissions required for deleting single-row records from the Hive on HBase table + + =========================== ========================== + Cluster Authentication Mode Required Permission + =========================== ========================== + Security mode SELECT, INSERT, and DELETE + Common mode None + =========================== ========================== + +Procedure +--------- + +#. To delete some data from an HBase table, run the following HQL statement: + + **remove table where ;** + + In the preceding information, ** specifies the filter condition of the data to be deleted. ** indicates the Hive on HBase table from which data is to be deleted. diff --git a/doc/component-operation-guide/source/using_hive/disabling_of_specifying_the_location_keyword_when_creating_an_internal_hive_table.rst b/doc/component-operation-guide/source/using_hive/disabling_of_specifying_the_location_keyword_when_creating_an_internal_hive_table.rst new file mode 100644 index 0000000..ac99f87 --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/disabling_of_specifying_the_location_keyword_when_creating_an_internal_hive_table.rst @@ -0,0 +1,37 @@ +:original_name: mrs_01_0970.html + +.. _mrs_01_0970: + +Disabling of Specifying the location Keyword When Creating an Internal Hive Table +================================================================================= + +Scenario +-------- + +This function is applicable to Hive and Spark2x for MRS 3.\ *x* or later, or Hive and Spark for versions earlier than MRS 3.x. + +After this function is enabled, the **location** keyword cannot be specified when a Hive internal table is created. Specifically, after a table is created, the table path following the location keyword is created in the default **\\warehouse** directory and cannot be specified to another directory. If the location is specified when the internal table is created, the creation fails. + +.. note:: + + After this function is enabled, the location keyword cannot be specified during the creation of a Hive internal table. The table creation statement is restricted. If a table that has been created in the database is not stored in the default directory **/warehouse**, the **location** keyword can still be specified when the database creation, table script migration, or metadata recreation operation is performed by disabling this function temporarily. + +Procedure +--------- + +#. The Hive service configuration page is displayed. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager, choose **Services** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + - For MRS 1.9.2 or later, click the cluster name on the MRS console, choose **Components** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + + - For MRS 3.\ *x* or later, log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. And choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Configurations** > **All Configurations**. + +#. Choose **HiveServer(Role)** > **Customization**, add a customized parameter to the **hive-site.xml** parameter file, set **Name** to **hive.internaltable.notallowlocation**, and set **Value** to **true**. Restart all Hive instances after the modification. +#. Determine whether to enable this function on the Spark/Spark2x client. + + - If yes, download and install the Spark/Spark2x client again. + - If no, no further action is required. diff --git a/doc/component-operation-guide/source/using_hive/enabling_or_disabling_the_transform_function.rst b/doc/component-operation-guide/source/using_hive/enabling_or_disabling_the_transform_function.rst new file mode 100644 index 0000000..bbc9c2c --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/enabling_or_disabling_the_transform_function.rst @@ -0,0 +1,40 @@ +:original_name: mrs_01_0958.html + +.. _mrs_01_0958: + +Enabling or Disabling the Transform Function +============================================ + +Scenario +-------- + +The Transform function is not allowed by Hive of the open source version. + +MRS Hive supports the configuration of the Transform function. The function is disabled by default, which is the same as that of the open-source community version. + +Users can modify configurations of the Transform function to enable the function. However, security risks exist when the Transform function is enabled. + +.. note:: + + The Transform function can be disabled only in security mode. + +Procedure +--------- + +#. The Hive service configuration page is displayed. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager, choose **Services** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + - For MRS 1.9.2 or later, click the cluster name on the MRS console, choose **Components** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + + - For MRS 3.\ *x* or later, log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. And choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Configurations** > **All Configurations**. + +#. Enter the parameter name in the search box, search for **hive.security.transform.disallow**, change the parameter value to **true** or **false**, and restart all HiveServer instances. + + .. note:: + + - If this parameter is set to **true**, the Transform function is disabled, which is the same as that in the open-source community version. + - If this parameter is set to **false**, the Transform function is enabled, which poses security risks. diff --git a/doc/component-operation-guide/source/using_hive/enabling_the_function_of_creating_a_foreign_table_in_a_directory_that_can_only_be_read.rst b/doc/component-operation-guide/source/using_hive/enabling_the_function_of_creating_a_foreign_table_in_a_directory_that_can_only_be_read.rst new file mode 100644 index 0000000..718aecb --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/enabling_the_function_of_creating_a_foreign_table_in_a_directory_that_can_only_be_read.rst @@ -0,0 +1,38 @@ +:original_name: mrs_01_0971.html + +.. _mrs_01_0971: + +Enabling the Function of Creating a Foreign Table in a Directory That Can Only Be Read +====================================================================================== + +Scenario +-------- + +This function is applicable to Hive and Spark2x for MRS 3.\ *x* or later, or Hive and Spark for versions earlier than MRS 3.x. + +After this function is enabled, the user or user group that has the read and execute permissions on a directory can create foreign tables in the directory without checking whether the current user is the owner of the directory. In addition, the directory of a foreign table cannot be stored in the default directory **\\warehouse**. In addition, do not change the permission of the directory during foreign table authorization. + +.. note:: + + After this function is enabled, the function of the foreign table changes greatly. Based on the actual application scenario, determine whether to enable this function. + +Procedure +--------- + +#. The Hive service configuration page is displayed. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager, choose **Services** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + - For MRS 1.9.2 or later, click the cluster name on the MRS console, choose **Components** > **Hive** > **Service Configuration**, and select **All** from the **Basic** drop-down list. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + + - For MRS 3.\ *x* or later, log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. And choose **Cluster** > *Name of the desired cluster* > **Services** > **Hive** > **Configurations** > **All Configurations**. + +#. Choose **HiveServer(Role)** > **Customization**, add a customized parameter to the **hive-site.xml** parameter file, set **Name** to **hive.restrict.create.grant.external.table**, and set **Value** to **true**. +#. Choose **MetaStore(Role)** > **Customization**, add a customized parameter to the **hivemetastore-site.xml** parameter file, set **Name** to **hive.restrict.create.grant.external.table**, and set **Value** to **true**. Restart all Hive instances after the modification. +#. Determine whether to enable this function on the Spark/Spark2x client. + + - If yes, download and install the Spark/Spark2x client again. + - If no, no further action is required. diff --git a/doc/component-operation-guide/source/using_hive/hive_log_overview.rst b/doc/component-operation-guide/source/using_hive/hive_log_overview.rst new file mode 100644 index 0000000..3991a5a --- /dev/null +++ b/doc/component-operation-guide/source/using_hive/hive_log_overview.rst @@ -0,0 +1,129 @@ +:original_name: mrs_01_0976.html + +.. _mrs_01_0976: + +Hive Log Overview +================= + +Log Description +--------------- + +**Log path**: The default save path of Hive logs is **/var/log/Bigdata/hive/**\ *role name*, the default save path of Hive1 logs is **/var/log/Bigdata/hive1/**\ *role name*, and the others follow the same rule. + +- HiveServer: **/var/log/Bigdata/hive/hiveserver** (run log) and **var/log/Bigdata/audit/hive/hiveserver** (audit log) +- MetaStore: **/var/log/Bigdata/hive/metastore** (run log) and **/var/log/Bigdata/audit/hive/metastore** (audit log) +- WebHCat: **/var/log/Bigdata/hive/webhcat** (run log) and **/var/log/Bigdata/audit/hive/webhcat** (audit log) + +**Log archive rule**: The automatic compression and archiving function of Hive is enabled. By default, when the size of a log file exceeds 20 MB (which is adjustable), the log file is automatically compressed. The naming rule of a compressed log file is as follows: <*Original log name*>-<*yyyy-mm-dd_hh-mm-ss*>.[*ID*].\ **log.zip** A maximum of 20 latest compressed files are reserved. The number of compressed files and compression threshold can be configured. + +.. table:: **Table 1** Hive log list + + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | Log Type | Log File Name | Description | + +=======================+============================================================================+=====================================================================================+ + | Run log | /hiveserver/hiveserver.out | Log file that records HiveServer running environment information. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /hiveserver/hive.log | Run log file of the HiveServer process. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /hiveserver/hive-omm-**\ ``-``\ **-gc.log.\ ** | GC log file of the HiveServer process. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /hiveserver/prestartDetail.log | Work log file before the HiveServer startup. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /hiveserver/check-serviceDetail.log | Log file that records whether the Hive service starts successfully | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /hiveserver/cleanupDetail.log | Cleanup log file about the HiveServer uninstallation | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /hiveserver/startDetail.log | Startup log file of the HiveServer process. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /hiveserver/stopDetail.log | Shutdown log file of the HiveServer process. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /hiveserver/localtasklog/omm\_\ **\ \_\ **.log | Run log file of the local Hive task. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /hiveserver/localtasklog/omm\_\ **\ \_\ **-gc.log.\ ** | GC log file of the local Hive task. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /metastore/metastore.log | Run log file of the MetaStore process. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /metastore/hive-omm-**\ ``-``\ **-gc.log.\ ** | GC log file of the MetaStore process. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /metastore/postinstallDetail.log | Work log file after the MetaStore installation. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /metastore/prestartDetail.log | Work log file before the MetaStore startup | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /metastore/cleanupDetail.log | Cleanup log file of the MetaStore uninstallation | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /metastore/startDetail.log | Startup log file of the MetaStore process. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /metastore/stopDetail.log | Shutdown log file of the MetaStore process. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /metastore/metastore.out | Log file that records MetaStore running environment information. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /webhcat/webhcat-console.out | Log file that records the normal start and stop of the WebHCat process. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /webhcat/webhcat-console-error.out | Log file that records the start and stop exceptions of the WebHCat process. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /webhcat/prestartDetail.log | Work log file before the WebHCat startup. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /webhcat/cleanupDetail.log | Cleanup logs generated during WebHCat uninstallation or before WebHCat installation | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /webhcat/hive-omm-<*Date*>--gc.log.<*No*.> | GC log file of the WebHCat process. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | /webhcat/webhcat.log | Run log file of the WebHCat process | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | Audit log | hive-audit.log | HiveServer audit log file | + | | | | + | | hive-rangeraudit.log | | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | metastore-audit.log | MetaStore audit log file. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | webhcat-audit.log | WebHCat audit log file. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + | | jetty-.request.log | Request logs of the jetty service. | + +-----------------------+----------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ + +Log Levels +---------- + +:ref:`Table 2 ` describes the log levels supported by Hive. + +Levels of run logs are ERROR, WARN, INFO, and DEBUG from the highest to the lowest priority. Run logs of equal or higher levels are recorded. The higher the specified log level, the fewer the logs recorded. + +.. _mrs_01_0976__t91045e1a946a46b4bac39028af62f3ad: + +.. table:: **Table 2** Log levels + + +-------+------------------------------------------------------------------------------------------+ + | Level | Description | + +=======+==========================================================================================+ + | ERROR | Logs of this level record error information about system running. | + +-------+------------------------------------------------------------------------------------------+ + | WARN | Logs of this level record exception information about the current event processing. | + +-------+------------------------------------------------------------------------------------------+ + | INFO | Logs of this level record normal running status information about the system and events. | + +-------+------------------------------------------------------------------------------------------+ + | DEBUG | Logs of this level record the system information and system debugging information. | + +-------+------------------------------------------------------------------------------------------+ + +To modify log levels, perform the following operations: + +#. Go to the **All Configurations** page of the Yarn service by referring to :ref:`Modifying Cluster Service Configuration Parameters `. +#. On the menu bar on the left, select the log menu of the target role. +#. Select a desired log level and save the configuration. + + .. note:: + + The Hive log level takes effect immediately after being configured. You do not need to restart the service. + +Log Formats +----------- + +The following table lists the Hive log formats: + +.. table:: **Table 3** Log formats + + +-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Log Type | Format | Example | + +===========+=====================================================================================================================================================================+=====================================================================================================================================================================================================================================================================================+ + | Run log | |||| | 2014-11-05 09:45:01,242 \| INFO \| main \| Starting hive metastore on port 21088 \| org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:5198) | + +-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Audit log | |||
**. | + | | | + | | The fields to be queried cannot contain fields whose data type is string. Otherwise, the error message "java.sql.SQLException: Invalid value for getLong()" is displayed. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -columns | Specifies the fields to be imported. The format is **-Column id,\ Username**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -direct | Imports data to a relational database using a database import tool, for example, mysqlimport of MySQL, more efficient than the JDBC connection mode. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -direct-split-size | Splits the imported streams by byte. Especially when data is imported from PostgreSQL using the direct mode, a file that reaches the specified size can be divided into several independent files. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -inline-lob-limit | Sets the maximum value of an inline LOB. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -m or -num-mappers | Starts *n* (4 by default) maps to import data concurrently. The value cannot be greater than the maximum number of maps in a cluster. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -query, -e | Imports data from the query result. To use this parameter, you must specify the **-target-dir** and **-hive-table** parameters and use the query statement containing the WHERE clause as well as $CONDITIONS. | + | | | + | | Example: **-query'select \* from person where $CONDITIONS' -target-dir /user/hive/warehouse/person -hive-table person** | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -split-by | Specifies the column of a table used to split work units. Generally, the column name is followed by the primary key ID. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -table | Specifies the relational database table from which data is obtained. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -target-dir | Specifies the HDFS path. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -warehouse-dir | Specifies the directory for storing data to be imported. This parameter is applicable when data is imported to HDFS but cannot be used when you import data to Hive directories. This parameter cannot be used together with **-target-dir**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -where | Specifies the WHERE clause when data is imported from a relational database, for example, **-where 'id = 2'**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -z,-compress | Compresses sequence, text, and Avro data files using the GZIP compression algorithm. Data is not compressed by default. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -compression-codec | Specifies the Hadoop compression codec. GZIP is used by default. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -null-string | Specifies the string to be interpreted as **NULL** for string columns. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -null-non-string | Specifies the string to be interpreted as null for non-string columns. If this parameter is not specified, **NULL** will be used. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -check-column (col) | Specifies the column for checking incremental data import, for example, **id**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -incremental (mode) append | Incrementally imports data. | + | | | + | or last modified | **append**: appends records, for example, appending records that are greater than the value specified by **last-value**. | + | | | + | | **lastmodified**: appends data that is modified after the date specified by **last-value**. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | -last-value (value) | Specifies the maximum value (greater than the specified value) of the column after the last import. This parameter can be set as required. | + +-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Sqoop Usage Example +------------------- + +- Importing data from MySQL to HDFS using the **sqoop import** command + + **sqoop import --connect jdbc:mysql://10.100.231.134:3306/test --username root --password** *xxx* **--query 'SELECT \* FROM component where $CONDITIONS and component_id ="MRS 1.0_002"' --target-dir /tmp/component_test --delete-target-dir --fields-terminated-by "," -m 1 --as-textfile** + +- Exporting data from OBS to MySQL using the **sqoop export** command + + **sqoop export --connect jdbc:mysql://10.100.231.134:3306/test --username root --password** *xxx* **--table component14 -export-dir obs://obs-file-bucket/xx/part-m-00000 --fields-terminated-by ',' -m 1** + +- Importing data from MySQL to OBS using the **sqoop import** command + + **sqoop import --connect jdbc:mysql://10.100.231.134:3306/test --username root --password** *xxx* **--table component --target-dir obs://obs-file-bucket/xx --delete-target-dir --fields-terminated-by "," -m 1 --as-textfile** + +- Importing data from MySQL to OBS tables outside Hive + + **sqoop import --connect jdbc:mysql://10.100.231.134:3306/test --username root --password** *xxx* **--table component --hive-import --hive-table component_test01 --fields-terminated-by "," -m 1 --as-textfile** diff --git a/doc/component-operation-guide/source/using_storm/accessing_the_storm_web_ui.rst b/doc/component-operation-guide/source/using_storm/accessing_the_storm_web_ui.rst new file mode 100644 index 0000000..c3e85e5 --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/accessing_the_storm_web_ui.rst @@ -0,0 +1,56 @@ +:original_name: mrs_01_0382.html + +.. _mrs_01_0382: + +Accessing the Storm Web UI +========================== + +Scenario +-------- + +The Storm web UI provides a graphical interface for using Storm. + +The following information can be queried on the Storm web UI: + +- Storm cluster summary +- Nimbus summary +- Topology summary +- Supervisor summary +- Nimbus configurations + +Prerequisites +------------- + +- The password of user **admin** has been obtained. The password of user **admin** is specified by you during the cluster creation. +- If a user other than **admin** is used to access the Storm web UI, the user must be added to the **storm** or **stormadmin** user group. + +Procedure +--------- + +#. Access the component management page. + + - For versions earlier than MRS 1.9.2, log in to MRS Manager and choose **Services**. + - For versions earlier than MRS 3.\ *x*, click the cluster name to go to the cluster details page and choose **Components**. + + .. note:: + + If the **Components** tab is unavailable, complete IAM user synchronization first. (On the **Dashboard** page, click **Synchronize** on the right side of **IAM User Sync** to synchronize IAM users.) + + - For MRS 3.\ *x* or later, log in to FusionInsight Manager. For details, see :ref:`Accessing FusionInsight Manager (MRS 3.x or Later) `. Choose **Cluster** > *Name of the desired cluster* > **Services**. + +#. Log in to the Storm WebUI. + + - For versions earlier than MRS 3.x: Choose **Storm**. On the **Storm Summary** area, click any UI link on the right side of **Storm Web UI** to open the Storm web UI. + + .. note:: + + When accessing the Storm web UI for the first time, you must add the address to the trusted site list. + + - For MRS 3.x or later, choose **Storm** > **Overview**. In the **Basic Information** area, click any UI link on the right side of **Storm Web UI** to open the Storm web UI. + +Related Tasks +------------- + +- Click a topology name to view details, status, Spouts information, Bolts information, and configuration information of the topology. +- In the **Topology actions** area, click **Activate**, **Deactivate**, **Rebalance**, **Kill**, **Debug**, **Stop Debug**, and **Change Log Level** to activate, deactivate, redeploy, delete, debug, and stop debugging the topology, and modify the log levels, respectively. You need to set the waiting time for the redeployment and deletion operations. The unit is second. +- In the **Topology Visualization** area, click **Show Visualization** to visualize a topology. After the topology is visualized, the WebUI displays the topology structure. diff --git a/doc/component-operation-guide/source/using_storm/configuring_a_storm_service_user_password_policy.rst b/doc/component-operation-guide/source/using_storm/configuring_a_storm_service_user_password_policy.rst new file mode 100644 index 0000000..4a39ba6 --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/configuring_a_storm_service_user_password_policy.rst @@ -0,0 +1,121 @@ +:original_name: mrs_01_1047.html + +.. _mrs_01_1047: + +Configuring a Storm Service User Password Policy +================================================ + +Scenario +-------- + +This section applies to MRS 3.\ *x* or later. + +After submitting a topology task, a Storm service user must ensure that the task continuously runs. During topology running, the worker process may need to restart to ensure continuous topology work. If the password of a service user is changed or the number of days that a password is used exceeds the maximum number specified in a password policy, topology running may be affected. A system administrator must configure a separate password policy for Storm service users based on enterprise security requirements. + +.. note:: + + If a separate password policy is not configured for Storm service users, an old topology can be deleted and then submitted again after a service user password is changed so that the topology can continuous run. + +Impact on the System +-------------------- + +- After a separate password policy is configured for a Storm service user, the user is not affected by **Password Policy** on the Manager page. +- If a separate password policy is configured for a Storm service user and cross-cluster entrusted relationships are configured, a password must be reset for the Storm service user on Manager based on the password policy. + +Prerequisites +------------- + +A system administrator has understood service requirements and created a **Human-Machine** user, for example, **testpol**. + +Procedure +--------- + +#. Log in to any node in the cluster as user **omm**. + +#. Run the following command to disable logout upon timeout: + + **TMOUT=0** + + .. note:: + + After the operations in this section are complete, run the **TMOUT=**\ *Timeout interval* command to restore the timeout interval in a timely manner. For example, **TMOUT=600** indicates that a user is logged out if the user does not perform any operation within 600 seconds. + +#. Run the following commands to export the environment variables: + + **EXECUTABLE_HOME="${CONTROLLER_HOME}/kerberos_user_specific_binay/kerberos"** + + **LD_LIBRARY_PATH=${EXECUTABLE_HOME}/lib:$LD_LIBRARY_PATH** + + **PATH=${EXECUTABLE_HOME}/bin:$PATH** + +#. Run the following command and enter the Kerberos administrator password to log in to the Kerberos console: + + **kadmin -p kadmin/admin** + + .. note:: + + For initial use, the **kadmin/admin** password must be changed for the **kadmin/admin** user. + + If the following information is displayed, you have successfully logged in to the Kerberos console. + + .. code-block:: + + kadmin: + +#. Run the following command to check details about the created **Human-Machine** user: + + **getprinc**\ *Username* + + Sample command for viewing details about the **testpol** user: + + **getprinc testpol** + + If the following information is displayed, the specified user has used the default password policy: + + .. code-block:: + + Principal: testpol@ + ...... + Policy: default + +#. Run the following command to create a separate password policy, such as **streampol**, for the Storm service user: + + **addpol -maxlife 0day -minlife 0sec -history 1 -maxfailure 5 -failurecountinterval 5min -lockoutduration 5min -minlength 8 -minclasses 4 streampol** + + In the command, **-maxlife** indicates the maximum validity period of a password, and **0day** indicates that a password will never expire. + +#. Run the following command to view the newly created policy **streampol**: + + **getpol streampol** + + If the following information is displayed, the new policy specifies that the password will never expire: + + .. code-block:: + + Policy: streampol + Maximum password life: 0 days 00:00:00 + ...... + +#. Run the following command to apply the new policy **streampol** to the **testpol** Storm user: + + **modprinc -policy streampol testpol** + + In the command, **streampol** indicates a policy name, and **testpol** indicates a username. + + If the following information is displayed, the properties of the specified user have been modified: + + .. code-block:: + + Principal "testpol@" modified. + +#. Run the following command to view current information about the **testpol** Storm user: + + **getprinc testpol** + + If the following information is displayed, the specified user has used the new password policy: + + .. code-block:: + + Principal: testpol@ + ...... + Policy: streampol diff --git a/doc/component-operation-guide/source/using_storm/index.rst b/doc/component-operation-guide/source/using_storm/index.rst new file mode 100644 index 0000000..67bad20 --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/index.rst @@ -0,0 +1,34 @@ +:original_name: mrs_01_0380.html + +.. _mrs_01_0380: + +Using Storm +=========== + +- :ref:`Using Storm from Scratch ` +- :ref:`Using the Storm Client ` +- :ref:`Submitting Storm Topologies on the Client ` +- :ref:`Accessing the Storm Web UI ` +- :ref:`Managing Storm Topologies ` +- :ref:`Querying Storm Topology Logs ` +- :ref:`Storm Common Parameters ` +- :ref:`Configuring a Storm Service User Password Policy ` +- :ref:`Migrating Storm Services to Flink ` +- :ref:`Storm Log Introduction ` +- :ref:`Performance Tuning ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + using_storm_from_scratch + using_the_storm_client + submitting_storm_topologies_on_the_client + accessing_the_storm_web_ui + managing_storm_topologies + querying_storm_topology_logs + storm_common_parameters + configuring_a_storm_service_user_password_policy + migrating_storm_services_to_flink/index + storm_log_introduction + performance_tuning/index diff --git a/doc/component-operation-guide/source/using_storm/managing_storm_topologies.rst b/doc/component-operation-guide/source/using_storm/managing_storm_topologies.rst new file mode 100644 index 0000000..a6e9207 --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/managing_storm_topologies.rst @@ -0,0 +1,54 @@ +:original_name: mrs_01_0383.html + +.. _mrs_01_0383: + +Managing Storm Topologies +========================= + +Scenario +-------- + +You can manage Storm topologies on the Storm web UI. Users in the **storm** group can manage only the topology tasks submitted by themselves, while users in the **stormadmin** group can manage all topology tasks. + +Procedure +--------- + +#. For details about how to access the Storm WebUI, see :ref:`Accessing the Storm Web UI `. + +#. In the **Topology summary** area, click the desired topology. + +#. Use options in **Topology actions** to manage the Storm topology. + + - Activating a topology + + Click **Activate** to activate the topology. + + - Deactivating a topology + + Click **Deactivate** to deactivate the topology. + + - Re-deploying a topology + + Click **Rebalance** and specify the wait time (in seconds) of re-deployment. Generally, if the number of nodes in a cluster changes, the topology can be re-deployed to maximize resource usage. + + - Deleting a topology + + Click **Kill** and specify the wait time (in seconds) of the deletion. + + - Starting or stopping sampling messages + + Click **Debug**. In the dialog box displayed, specify the percentage of the sampled data volume. For example, if the value is set to **10**, 10% of data is sampled. + + To stop sampling, click **Stop Debug**. + + .. note:: + + This function is available only if the sampling function is enabled when the topology is submitted. For details about querying data processing information, see :ref:`Querying Storm Topology Logs `. + + - Modifying the topology log level + + Click **Change Log Level** to specify a new log level. + +#. Displaying a topology + + In the **Topology Visualization** area, click **Show Visualization** to visualize the topology. diff --git a/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/completely_migrating_storm_services.rst b/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/completely_migrating_storm_services.rst new file mode 100644 index 0000000..bf7ea87 --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/completely_migrating_storm_services.rst @@ -0,0 +1,103 @@ +:original_name: mrs_01_1050.html + +.. _mrs_01_1050: + +Completely Migrating Storm Services +=================================== + +Scenarios +--------- + +This section describes how to convert and run a complete Storm topology developed using Storm API. + +Procedure +--------- + +#. Open the Storm service project, modify the POM file of the project, and add the reference of **flink-storm_2.11**, **flink-core**, and **flink-streaming-java_2.11**. The following figure shows an example. + + .. code-block:: + + + org.apache.flink + flink-storm_2.11 + 1.4.0 + + + * + * + + + + + .. code-block:: + + + org.apache.flink + flink-core + 1.4.0 + + + * + * + + + + + .. code-block:: + + + org.apache.flink + flink-streaming-java_2.11 + 1.4.0 + + + * + * + + + + + .. note:: + + If the project is not a non-Maven project, manually collect the preceding JAR packages and add them to the *classpath* environment variable of the project. + +2. Modify the code for submission of the topology. The following uses WordCount as an example: + + a. Keep the structure of the Storm topology unchanged, including the Spout and Bolt developed using Storm API. + + .. code-block:: + + TopologyBuilder builder = new TopologyBuilder(); + builder.setSpout("spout", new RandomSentenceSpout(), 5); + builder.setBolt("split", new SplitSentenceBolt(), 8).shuffleGrouping("spout"); + builder.setBolt("count", new WordCountBolt(), 12).fieldsGrouping("split", new Fields("word")); + + b. Modify the code for submission of the topology. An example is described as follows: + + .. code-block:: + + Config conf = new Config(); + conf.setNumWorkers(3); + StormSubmitter.submitTopology("word-count", conf, builder.createTopology()); + + Perform the following operations: + + .. code-block:: + + Config conf = new Config(); + conf.setNumWorkers(3); + //converts Storm Config to StormConfig of Flink. + StormConfig stormConfig = new StormConfig(conf); + //Construct FlinkTopology using TopologBuilder of Storm. + FlinkTopology topology = FlinkTopology.createTopology(builder); + //Obtain the Stream execution environment. + StreamExecutionEnvironment env = topology.getExecutionEnvironment(); + //Set StormConfig to the environment variable of Job to construct Bolt and Spout. + //If StormConfig is not required during the initialization of Bolt and Spout, you do not need to set this parameter. + env.getConfig().setGlobalJobParameters(stormConfig); + //Submit the topology. + topology.execute(); + + c. After the package is repacked, run the following command to submit the package: + + **flink run -class {MainClass} WordCount.jar** diff --git a/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/index.rst b/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/index.rst new file mode 100644 index 0000000..f746a21 --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/index.rst @@ -0,0 +1,20 @@ +:original_name: mrs_01_1048.html + +.. _mrs_01_1048: + +Migrating Storm Services to Flink +================================= + +- :ref:`Overview ` +- :ref:`Completely Migrating Storm Services ` +- :ref:`Performing Embedded Service Migration ` +- :ref:`Migrating Services of External Security Components Interconnected with Storm ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + overview + completely_migrating_storm_services + performing_embedded_service_migration + migrating_services_of_external_security_components_interconnected_with_storm diff --git a/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/migrating_services_of_external_security_components_interconnected_with_storm.rst b/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/migrating_services_of_external_security_components_interconnected_with_storm.rst new file mode 100644 index 0000000..e43f67f --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/migrating_services_of_external_security_components_interconnected_with_storm.rst @@ -0,0 +1,65 @@ +:original_name: mrs_01_1052.html + +.. _mrs_01_1052: + +Migrating Services of External Security Components Interconnected with Storm +============================================================================ + +Migrating Services for Interconnecting Storm with HDFS and HBase +---------------------------------------------------------------- + +If the Storm services use the **storm-hdfs** or **storm-hbase** plug-in package for interconnection, you need to specify the following security parameters when migrating Storm services as instructed in :ref:`Completely Migrating Storm Services `. + +.. code-block:: + + //Initialize Storm Config. + Config conf = new Config(); + + //Initialize the security plug-in list. + List auto_tgts = new ArrayList(); + //Add the AutoTGT plug-in. + auto_tgts.add("org.apache.storm.security.auth.kerberos.AutoTGT"); + //Add the AutoHDFS plug-in. + //If HBase is interconnected, use auto_tgts.add("org.apache.storm.hbase.security.AutoHBase") to replace the following: + auto_tgts.add("org.apache.storm.hdfs.common.security.AutoHDFS"); + + //Set security parameters. + conf.put(Config.TOPOLOGY_AUTO_CREDENTIALS, auto_tgts); + //Set the number of workers. + conf.setNumWorkers(3); + + //Convert Storm Config to StormConfig of Flink. + StormConfig stormConfig = new StormConfig(conf); + + //Construct FlinkTopology using TopologBuilder of Storm. + FlinkTopology topology = FlinkTopology.createTopology(builder); + + //Obtain the StreamExecutionEnvironment. + StreamExecutionEnvironment env = topology.getExecutionEnvironment(); + + //Add StormConfig to the environment variable of Job to construct Bolt and Spout. + //If Config is not required during the initialization of Bolt and Spout, do not set this parameter. + env.getConfig().setGlobalJobParameters(stormConfig); + + //Submit the topology. + topology.execute(); + +After the preceding security plug-in is configured, unnecessary logins during the initialization of HDFSBolt and HBaseBolt are avoided because the security context has been configured in Flink. + +Migrating Services of Storm Interconnected with Other Security Components +------------------------------------------------------------------------- + +If the plug-in packages, such as **storm-kakfa-client** and **storm-solr** are used for interconnection between Storm and other components for service migration, the previously configured security plug-ins need to be deleted. + +.. code-block:: + + List auto_tgts = new ArrayList(); + //keytab mode + auto_tgts.add("org.apache.storm.security.auth.kerberos.AutoTGTFromKeytab"); + + //Write the plug-in list configured on the client to the specified config parameter. + //Mandatory in security mode + //This configuration is not required in common mode, and you can comment out the following line. + conf.put(Config.TOPOLOGY_AUTO_CREDENTIALS, auto_tgts); + +The AutoTGTFromKeytab plug-in must be deleted during service migration. Otherwise, the login will fail when Bolt or Spout is initialized. diff --git a/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/overview.rst b/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/overview.rst new file mode 100644 index 0000000..c30008e --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/overview.rst @@ -0,0 +1,17 @@ +:original_name: mrs_01_1049.html + +.. _mrs_01_1049: + +Overview +======== + +This section applies to MRS 3.\ *x* or later. + +From 0.10.0, Flink provides a set of APIs to smoothly migrate services compiled using Storm APIs to the Flink platform. This can be used in most of the service scenarios. + +Flink supports the following service migration modes: + +#. Complete migration of Storm services: Convert and run a complete Storm topology developed by Storm APIs. +#. Embedded migration of Storm services: Storm code is embedded in DataStream of Flink, for example, Spout/Bolt compiled using Storm APIs. + +Flink provides the flink-storm package for the preceding service migration. diff --git a/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/performing_embedded_service_migration.rst b/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/performing_embedded_service_migration.rst new file mode 100644 index 0000000..4c51fa0 --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/migrating_storm_services_to_flink/performing_embedded_service_migration.rst @@ -0,0 +1,37 @@ +:original_name: mrs_01_1051.html + +.. _mrs_01_1051: + +Performing Embedded Service Migration +===================================== + +Scenarios +--------- + +This section describes how to embed Storm code in DataStream of Flink in embedded migration mode. For example, the code of Spout or Bolt compiled using Storm API is embedded. + +Procedure +--------- + +#. In Flink, perform embedded conversion to Spout and Bolt in the Storm topology to convert them to Flink operators. The following is an example of the code: + + .. code-block:: + + //set up the execution environment + final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); + //get input data + final DataStream text = getTextDataStream(env); + final DataStream> counts = text + //split up the lines in pairs (2-tuples) containing: (word,1) + //this is done by a bolt that is wrapped accordingly + .transform("CountBolt", + TypeExtractor.getForObject(new Tuple2("", 0)), + new BoltWrapper>(new CountBolt())) + //group by the tuple field "0" and sum up tuple field "1" + .keyBy(0).sum(1); + // execute program + env.execute("Streaming WordCount with bolt tokenizer"); + +#. After the modification, run the following command to submit the modification: + + **flink run -class {MainClass} WordCount.jar** diff --git a/doc/component-operation-guide/source/using_storm/performance_tuning/index.rst b/doc/component-operation-guide/source/using_storm/performance_tuning/index.rst new file mode 100644 index 0000000..fc16ffc --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/performance_tuning/index.rst @@ -0,0 +1,14 @@ +:original_name: mrs_01_1054.html + +.. _mrs_01_1054: + +Performance Tuning +================== + +- :ref:`Storm Performance Tuning ` + +.. toctree:: + :maxdepth: 1 + :hidden: + + storm_performance_tuning diff --git a/doc/component-operation-guide/source/using_storm/performance_tuning/storm_performance_tuning.rst b/doc/component-operation-guide/source/using_storm/performance_tuning/storm_performance_tuning.rst new file mode 100644 index 0000000..692618a --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/performance_tuning/storm_performance_tuning.rst @@ -0,0 +1,47 @@ +:original_name: mrs_01_1055.html + +.. _mrs_01_1055: + +Storm Performance Tuning +======================== + +Scenario +-------- + +You can modify Storm parameters to improve Storm performance in specific service scenarios. + +This section applies to MRS 3.x or later. + +Modify the service configuration parameters. For details, see :ref:`Modifying Cluster Service Configuration Parameters `. + +Topology Tuning +--------------- + +This task enables you to optimize topologies to improve efficiency for Storm to process data. It is recommended that topologies be optimized in scenarios with lower reliability requirements. + +.. table:: **Table 1** Tuning parameters + + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Scenario | + +===============================+===============+================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | topology.acker.executors | null | Specifies the number of acker executors. If a service application has lower reliability requirements and certain data does not need to be processed, this parameter can be set to **null** or **0** so that you can set acker off, flow control is weakened, and message delay is not calculated. This improves performance. | + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | topology.max.spout.pending | null | Specifies the number of messages cached by spout. The parameter value takes effect only when acker is not **0** or **null**. Spout adds each message sent to downstream bolt into the pending queue. The message is removed from the queue after downstream bolt processes the message and the processing is confirmed. When the pending queue is full, spout stops sending messages. Increasing the pending value improves the message throughput of spout per second but prolongs the delay. | + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | topology.transfer.buffer.size | 32 | Specifies the size of the Distuptor message queue for each worker process. It is recommended that the size be between 4 to 32. Increasing the queue size improves the throughput but may prolong the delay. | + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | RES_CPUSET_PERCENTAGE | 80 | Specifies the percentage of physical CPU resources used by the supervisor role instance (including startup and management worker processes) on each node. Adjust the parameter value based on service volume requirements of the node on which the supervisor exists, to optimize CPU usage. | + +-------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +JVM Tuning +---------- + +If an application must occupy more memory resources to process a large volume of data and the size of worker memory is greater than 2 GB, the G1 garbage collection algorithm is recommended. + +.. table:: **Table 2** Tuning parameters + + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Default Value | Scenario | + +================+====================================================================================================================================================================================================================================================================+===========================================================================================================================================================================================================================================================================+ + | WORKER_GC_OPTS | -Xms1G -Xmx1G -XX:+UseG1GC -XX:+PrintGCDetails -Xloggc:artifacts/gc.log -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=artifacts/heapdump | If an application must occupy more memory resources to process a large volume of data and the size of worker memory is greater than 2 GB, the G1 garbage collection algorithm is recommended. In this case, change the parameter value to **-Xms2G -Xmx5G -XX:+UseG1GC**. | + +----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_storm/querying_storm_topology_logs.rst b/doc/component-operation-guide/source/using_storm/querying_storm_topology_logs.rst new file mode 100644 index 0000000..e9d5223 --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/querying_storm_topology_logs.rst @@ -0,0 +1,32 @@ +:original_name: mrs_01_0384.html + +.. _mrs_01_0384: + +Querying Storm Topology Logs +============================ + +Scenario +-------- + +You can query topology logs to check the execution of a Storm topology in a worker process. To query the data processing logs of a topology, enable the **Debug** function when submitting the topology. Only streaming clusters with Kerberos authentication enabled support this function. In addition, the user who queries topology logs must be the one who submits the topology or a member of the **stormadmin** group. + +Prerequisites +------------- + +- The network of the working environment has been configured. +- The sampling function has been enabled for the topology. + +Querying Worker Process Logs +---------------------------- + +#. For details about how to access the Storm WebUI, see :ref:`Accessing the Storm Web UI `. +#. In the **Topology Summary** area, click the desired topology to view details. +#. Click the desired **Spouts** or **Bolts** task. In the **Executors (All time)** area, click a port in **Port** to view detailed logs. + +Querying Data Processing Logs of a Topology +------------------------------------------- + +#. For details about how to access the Storm WebUI, see :ref:`Accessing the Storm Web UI `. +#. In the **Topology Summary** area, click the desired topology to view details. +#. Click **Debug**, specify the data sampling ratio, and click **OK**. +#. Click the **Spouts** or **Bolts** task. In **Component summary**, click **events** to view data processing logs. diff --git a/doc/component-operation-guide/source/using_storm/storm_common_parameters.rst b/doc/component-operation-guide/source/using_storm/storm_common_parameters.rst new file mode 100644 index 0000000..870a487 --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/storm_common_parameters.rst @@ -0,0 +1,33 @@ +:original_name: mrs_01_1046.html + +.. _mrs_01_1046: + +Storm Common Parameters +======================= + +This section applies to MRS 3.\ *x* or later. + +Navigation Path +--------------- + +For details about how to set parameters, see :ref:`Modifying Cluster Service Configuration Parameters `. + +Parameter Description +--------------------- + +.. table:: **Table 1** Parameter description + + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Default Value | + +========================+=========================================================================================================================================================================================================================================================================================================================================================+====================================================================================================================================================================================================================================================================+ + | supervisor.slots.ports | Specifies the list of ports that can run workers on the supervisor. Each worker occupies a port, and each port runs only one worker. This parameter is used to set the number of workers that can run on each server. Ports range from 1024 to 65535, and ports are separated by commas (,). | 6700,6701,6702,6703 | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | WORKER_GC_OPTS | Specifies the JVM option used for supervisor to start worker. It is recommended that you set this parameter based on memory usage of a service. For simple service processing, the recommended value is **-Xmx1G**. If window cache is used, the value of this parameter is calculated based on the following formula: Size of each record x Period x 2 | -Xms1G -Xmx1G -XX:+UseG1GC -XX:+PrintGCDetails -Xloggc:artifacts/gc.log -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=artifacts/heapdump | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | default.schedule.mode | Specifies the default scheduling mode of the scheduler. Options are as follows: | AVERAGE | + | | | | + | | - **AVERAGE**: indicates that the scheduling mechanism that uses the number of idle slots as the priority is used. | | + | | - **RATE**: indicates that the scheduling mechanism that uses the rate of idle slots as the priority is used. | | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | nimbus.thrift.threads | Set the maximum number of connection threads when the active Nimbus externally provides services. If the Storm cluster is large and the number of Supervisor instances is large, increase connection threads. | 512 | + +------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/doc/component-operation-guide/source/using_storm/storm_log_introduction.rst b/doc/component-operation-guide/source/using_storm/storm_log_introduction.rst new file mode 100644 index 0000000..76e2dd6 --- /dev/null +++ b/doc/component-operation-guide/source/using_storm/storm_log_introduction.rst @@ -0,0 +1,165 @@ +:original_name: mrs_01_1053.html + +.. _mrs_01_1053: + +Storm Log Introduction +====================== + +This section applies to MRS 3.\ *x* or later. + +Log Description +--------------- + +Log paths: The default paths of Storm log files are **/var/log/Bigdata/storm/**\ *Role name* (run logs) and **/var/log/Bigdata/audit/storm/**\ *Role name* (audit logs). + +- Nimbus: **/var/log/Bigdata/storm/nimbus** (run logs) and **/var/log/Bigdata/audit/storm/nimbus** (audit logs) +- Supervisor: **/var/log/Bigdata/storm/supervisor** (run logs) and **/var/log/Bigdata/audit/storm/supervisor** (audit logs) +- UI: **/var/log/Bigdata/storm/ui** (run logs) and **/var/log/Bigdata/audit/storm/ui** (audit logs) +- Logviewer: **/var/log/Bigdata/storm/logviewer** (run logs) and **/var/log/Bigdata/audit/storm/logviewer** (audit logs) + +Log archive rule: The automatic Storm log compression function is enabled. By default, when the size of logs exceeds 10 MB, logs are automatically compressed into a log file named in the following format: **\ **.log.**\ *[ID]*\ **.gz**. A maximum of 20 latest compressed files are reserved by default. You can configure the number of compressed files and the compression threshold. + +Names of compressed audit log files are in the format of **audit.log.**\ *[yyyy-MM-dd]*\ **.**\ *[ID]*\ **.zip**. These files permanently exist. + +.. table:: **Table 1** Storm log list + + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Log Type | Log File Name | Description | + +===========+==============================================================+===========================================================================================================================================================================+ + | Run log | nimbus/access.log | Nimbus user access log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | nimbus/nimbus--gc.log | GC log of the Nimbus process | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | nimbus/checkavailable.log | Nimbus availability check log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | nimbus/checkService.log | Nimbus serviceability check log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | nimbus/metrics.log | Nimbus monitoring statistics log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | nimbus/nimbus.log | Run log of the Nimbus process | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | nimbus/postinstall.log | Work log after Nimbus installation | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | nimbus/prestart.log | Work log before Nimbus startup | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | nimbus/start.log | Work log of Nimbus startup | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | nimbus/stop.log | Work log of Nimbus shutdown | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | supervisor/access.log | Supervisor access log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | supervisor/metrics.log | Supervisor monitoring statistics log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | supervisor/postinstall.log | Work log after supervisor installation | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | supervisor/prestart.log | Work log before supervisor startup | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | supervisor/start.log | Work log of supervisor startup | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | supervisor/stop.log | Work log of supervisor shutdown | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | supervisor/supervisor.log | Run log of the supervisor process | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | supervisor/supervisor--gc.log | GC log of the supervisor process | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ui/access.log | UI access log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ui/metric.log | UI monitoring statistics log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ui/ui--gc.log | GC log of the UI process | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ui/postinstall.log | Work log after UI installation | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ui/prestart.log | Work log before UI startup | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ui/start.log | Work log of UI startup | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ui/stop.log | Work log of UI shutdown | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ui/ui.log | Run log of the UI process | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | logviewer/access.log | Logviewer access log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | logviewer/metric.log | Logviewer monitoring statistics log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | logviewer/logviewer--gc.log | GC log file of the logviewer process | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | logviewer/logviewer.log | Run log of the logviewer process | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | logviewer/postinstall.log | Work log after logviewer installation | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | logviewer/prestart.log | Work log before logviewer startup | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | logviewer/start.log | Work log of logviewer startup | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | logviewer/stop.log | Work log of logviewer shutdown | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | supervisor/[topologyId]-worker-[*Port number*].log | Run log of the Worker process. One port occupies one log file. By default, the system contains five ports: 29100, 29101, 29102, 29103 and 29304. | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | supervisor/metadata/[topologyid]-worker-[*Port number*].yaml | Worker log metadata file, which is used by logviewer to delete logs. This file is automatically deleted by the logviewer log deletion thread based on certain conditions. | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | nimbus/cleanup.log | Cleanup log of Nimbus uninstallation | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | logviewer/cleanup.log | Cleanup log of logviewer uninstallation | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ui/cleanup.log | Cleanup log of UI uninstallation | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | supervisor/cleanup.log | Cleanup log of supervisor uninstallation | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | leader_switch.log | Run log file that records the Storm active/standby switchover | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Audit log | nimbus/audit.log | Nimbus audit log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | ui/audit.log | UI audit log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | supervisor/audit.log | Supervisor audit log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | logviewer/audit | Logviewer audit log | + +-----------+--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +Log Levels +---------- + +:ref:`Table 2 ` describes the log levels supported by Storm. + +Levels of run logs and audit logs are ERROR, WARN, INFO, and DEBUG from the highest to the lowest priority. Run logs of equal or higher levels are recorded. The higher the specified log level, the fewer the logs recorded. + +.. _mrs_01_1053__t9e96567fde95476e83e7a7b9cce3954e: + +.. table:: **Table 2** Log levels + + +-------+------------------------------------------------------------------------------------------+ + | Level | Description | + +=======+==========================================================================================+ + | ERROR | Logs of this level record error information about system running. | + +-------+------------------------------------------------------------------------------------------+ + | WARN | Logs of this level record exception information about the current event processing. | + +-------+------------------------------------------------------------------------------------------+ + | INFO | Logs of this level record normal running status information about the system and events. | + +-------+------------------------------------------------------------------------------------------+ + | DEBUG | Logs of this level record the system information and system debugging information. | + +-------+------------------------------------------------------------------------------------------+ + +To modify log levels, perform the following operations: + +#. Go to the **All Configurations** page of Storm by referring to :ref:`Modifying Cluster Service Configuration Parameters `. +#. On the menu bar on the left, select the log menu of the target role. +#. Select a desired log level. +#. Save the configuration. In the displayed dialog box, click **OK** to make the configurations take effect. + +Log Format +---------- + +The following table lists the Storm log formats: + +.. table:: **Table 3** Log Formats + + +-----------+------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Log Type | Format | Example | + +===========+====================================================================================+=============================================================================================================================================================================================================================================================+ + | Run log | %d{yyyy-MM-dd HH:mm:ss,SSS} \| %-5p \| [%t] \| %m \| %logger (%F:%L) %n | 2015-03-11 23:04:00,241 \| INFO \| [RMI TCP Connection(2646)-10.0.0.2] \| The baseSleepTimeMs [1000] the maxSleepTimeMs [1000] the maxRetries [1] \| backtype.storm.utils.StormBoundedExponentialBackoffRetry (StormBoundedExponentialBackoffRetry.java:46) | + +-----------+------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | | | 2017-03-28 02:57:52 493 10-5-146-1 storm- INFO Nimbus start normally | + +-----------+------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Audit log | *